Logistic Regression Classification | Math

A deep insight into logistic regression including the math involved in it.

Hello Guys!!!

So far, we have discussed the Regression technique. Now, let us look at the second important Supervised learning algorithm, i.e. Classification.

Can classification problems be solved using Linear Regression? Why is Logistic Regression being a classification technique, still named regression? These questions along with a delineate on Logistic Regression will be addressed in this blog.

Introduction to Classification

Classification is a subcategory of supervised learning, where the process is of categorizing a given set of data into classes. Classes such as Good/Bad(reviews), Spam/Not spam(Emails) or Positive/Negative/Neutral (tweets). It can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label, or categories.

The classification predictive modeling is the task of approximating the mapping function from input variables to discrete output variables. The main goal is to identify which class/category the new data will fall into.

There are two main types of classification problems:

  • Binary classification: The typical example is e-mail spam detection, in which each e-mail is spam → 1 spam; or isn’t → 0.
  • Multi-class classification: Like handwritten character recognition (where classes go from 0 to 9).

Let’s try to understand this with an example.

Whenever we get an email, the spam classifier classifies it. The classifier uses Natural Language Processing(will be covered in future blogs) with ML to process the received text in the mail. It then decides based on the types of words or sentences used in the mail. Then it classifies, whether that mail is spam or not.

Real-life application of Classification:

  • Speech Recognition
  • Spam classification
  • Face detection
  • Handwriting recognition

Types Of Learners In Classification

  • Lazy Learners — Lazy learners simply store the training data and wait until testing data appears. The classification is done using the most related data in the stored training data. They have more predicting time compared to eager learners. Eg — k-nearest neighbor, case-based reasoning.
  • Eager Learners — Eager learners construct a classification model based on the given training data before getting data for predictions. It must be able to commit to a single hypothesis that will work for the entire space. Due to this, they take a lot of time in training and less time for a prediction. Eg — Decision Tree, Naïve Bayes, Artificial Neural Networks.

Now, we know the basics of classification. But before we start LogisticRegression, I would like to answer, Why can’t we use linear regression for classification?

To understand this, we will look at the following example:

Let’s say we create a perfectly balanced dataset, where it contains a list of customers and a label to determine if the customer had purchased. In the dataset, there are 20 customers; 10 customers age between 10 to 19 who purchased, and 10 customers age between 20 to 29 who did not purchase. “Purchased” is a binary label denote by 0 and 1, where 0 denotes “customer did not make a purchase” and 1 denotes “customer made a purchase”.

Data Point on Graph

The objective of a linear regression model is to find a relationship between the input variables and a target variable. Below is our linear regression model that was trained using the above dataset. The red line is the best fit line for the training dataset, which aims to minimize the distance between the predicted value and the actual value.

Linear Regression PLOT

The first problem with Linear Regression: It outputs continuous answers, whereas we need just two answers ‘0’ or ‘1’. Since it’s just binary classification, we can solve this problem by stating that — if Y is greater than 0.5 (above the green line), predict that this customer will make purchases otherwise will not make purchases. But, this won’t work in the multi-class classifier.

Second problem “The effect of outliers”:

Let’s add 10 more customers age between 60 to 70, and train our linear regression model, finding the best fit line.

Relation Line changed by adding more DATA

Our linear regression model manages to fit a new line, but if you look closer, some customer (age 20 to 22) outcomes are predicted wrongly. As linear regression tries to fit the regression line by minimizing prediction error, to minimize the distance of predicted and actual value for customers age between 60 to 70. This can only be solved by Logistic Regression.

Logistic Regression

Logistic Regression on above problem

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

Binary Logistic Regression

Here we predict two values like- yes/no. Our previous example, whether the customer will purchase the product or not, is an example of binary logistic regression.

Making Prediction:

Let’s use the same multiple linear regression equation from our linear regression tutorial.

P is applied sigmoid function

Cost Function

We won’t use the same cost function MSE as we did for linear regression. Because here the prediction function is non-linear due to sigmoid transformation. Thus, here MSE will result in many local minima and gradient descent may not find the optimal global minimum.
Instead of Mean Squared Error, we use a cost function called Cross-Entropy, also known as Log Loss. Cross-entropy loss can be divided into two separate cost functions: one for y=1 and one for y=0.

COST Function

The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0. These smooth monotonic functions (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost.

The key thing to note is the cost function penalizes confident and wrong predictions more than it rewards confident and right predictions!

Above functions compressed into one

Compressed COST function

Gradient descent

To minimize our cost, we use Gradient Descent just like before in Linear Regression. There are other more sophisticated optimization algorithms out there such as conjugate gradient-like BFGS(batch gradient descent), but you don’t have to worry about these. Machine learning libraries like Scikit-learn hide their implementations so you can focus on more interesting things!

Due to this underlying technology is quite the same as Linear “Regression”, this algorithm is also named as Logistic “Regression”.

Multiclass logistic regression

Instead of y=0,1, we will expand our definition so that y=0,1…n. We re-run binary classification multiple times, once for each class.

Procedure

  • Divide the problem into n+1 binary classification problems (+1 because the index starts at 0?).
  • Predict the probability the observations are in that single class.
  • prediction = max(probability of the classes)

For each sub-problem, we select one class (YES) and lump all the others into a second class (NO). Then we take the class with the highest predicted value. Here, we use the SOFTMAX function to do this.

The SoftMax regression is a form of logistic regression that normalizes an input value into a vector of values that follows a probability distribution whose total sums up to 1. The output values are between the range [0,1] which is nice because we can avoid binary classification and accommodate as many classes or dimensions in our neural network model. This function involves the use of an artificial neural network. This is why SoftMax is sometimes referred to as a multinomial logistic regression. The standard (unit) SoftMax function is defined by the formula:

SoftMax Intuition

In other words: we apply the standard exponential function to each element zi of the input vector z and normalize these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector σ(z) is 1.
There is a great depth in this function. Since we have scikit-learn Machine Learning library to do this in simple steps.

#same code for all types of classifier
from
sklearn.linear_model import LogisticRegression
#making a model
logreg = LogisticRegression()
#training our model
logreg.fit(x_train, y_train)
#predict output of model
test_y_hat = logreg.predict(x_test)
#test the model
print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_hat - y_test)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_hat - y_test) ** 2))from sklearn.metrics import accuracy_score
accuracy_score(y_test, test_y_hat)

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing. In simple language, when the same types of values coincide i.e. TP & TN are the correctly predicted values.

Confusion Matrix
#plotting confusion metrics 
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, test_y_hat)
sns.heatmap(cnf_matrix, annot=True)
plt.title("Confusion Matrix")
plt.show()

This leads to the end of this blog. I hope you liked it and understood most of it. Good BYE, Have a Great Day!

Learning and exploring this beautiful world with amazing tech.