Support Vector Machine | Classifier
SVM finds a suitable Hyperplane. How? Let’s go through it.
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning. Here, we will look at SVM as a classifier algorithm.
In my previous blog, I have already covered other classification and regression algorithms. Make sure you read them first to understand this algorithm perfectly.
SVMs have their unique way of implementation as compared to other machine learning algorithms. Lately, they are extremely popular because of their ability to handle multiple continuous and categorical variables.
So you’re working on a text classification problem. You’re refining your training data, and maybe you’ve even tried stuff out using Naïve Bayes. But now you’re feeling confident in your dataset, and want to take it one step further. Enter Support Vector Machines (SVM): a fast and dependable classification algorithm that performs very well with a limited amount of data.
How SVM works?
In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well.
But first understand some important terms for defining SVM. Please look at above image for better understanding.
Hyperplane
Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.
Support Vectors
Data points that are closest to the hyperplane is called support vectors. Support vectors influence the orientation of hyperplanes. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM.
Margin
It may be defined as the gap between two lines on the closet data points of different classes. It can be calculated as the perpendicular distance from the line to the support vectors. A large margin is considered a good margin and a small margin is considered a bad margin.
The GOAL of the SVM algorithm is to create the best line or decision boundary(hyperplane) that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future.
Let’s look at the possible hyperplanes:
FIG-1 VS FIG-2 — Which plane is better? Look at the above plane's scenario and try to identify the best figure. Here, maximizing the distances(known as Margin) between the nearest data point (either class) and the hyper-plane will help us to decide the right hyper-plane.
In order to maximize the margin, we need to minimize the loss function.
The cost is 0 if the predicted value and the actual value are of the same sign. If they are not, we then calculate the loss value. We also add a regularization parameter to the cost function. The objective of the regularization parameter is to balance the margin maximization and loss.
In Fig-2, you can see that the margin for hyper-plane C is high as compared to both A and B. Hence, we name the right hyper-plane as C. Another lightning reason for selecting the hyper-plane with a higher margin is robustness. If we select a hyper-plane having a low margin then there is a high chance of misclassification.
Let’s look at another scenario:
One point at the other end is like an outlier for classification. The SVM algorithm has a feature to ignore outliers and find the hyper-plane that has the maximum margin. Hence, we can say, SVM classification is robust to outliers.
Kernel Trick
In the SVM classifier, it is easy to have a linear hyper-plane between these two classes. But, another burning question which arises is, should we need to add this feature manually to have a hyper-plane? No, the SVM algorithm has a technique called the kernel trick. The SVM kernel is a function that takes low dimensional input space and transforms it to a higher dimensional space i.e. it converts a not separable problem to a separable problem. It is mostly useful in non-linear separation problems. Simply put, it does some extremely complex data transformations, then finds out the process to separate the data based on the labels or outputs you’ve defined.
Python Code for SVM
#Import Library
from sklearn import svm
import pandas as pd
from sklearn.metrics import accuracy_score#import warnings
warnings.filterwarnings(‘ignore’)# Create Linear SVM object
support = svm.LinearSVC(random_state=20)# Train the model using the training sets and check score on test dataset
support.fit(train_x, train_y)
predicted= support.predict(test_x)
score=accuracy_score(test_y,predicted)
How to tune the Parameters of SVM?
SVM Hyperparameter Tuning using GridSearchCV
GridSearchCV takes a dictionary that describes the parameters that could be tried on a model to train it. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested.
from sklearn.model_selection import GridSearchCV# defining parameter range
# you can also try adding more values to this for better performance
param_grid = {'C': [0.1, 1, 10, 100, 1000],'gamma': [1, 0.1, 0.01, 0.001, 0.0001],'kernel': ['rbf']}grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)# fitting the model for grid search
grid.fit(X_train, y_train)
Pros:
- It works really well with a clear margin of separation
- It is effective in high-dimensional spaces.
- It is effective in cases where the number of dimensions is greater than the number of samples.
- It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
Cons:
- It doesn’t perform well when we have a large data set because the required training time is higher
- It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping
- SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. It is included in the related SVC method of the Python sci-kit-learn library.
Cross-validation is useful to check whether the model is generalizing on data or not. In the k-fold cross-validation method, all the entries in the original training data set are used for both training as well as validation. Also, each entry is used for validation just once.
Conclusion
I hope this blog post helped in understanding SVMs. Comment down your thoughts, feedback or suggestions if any below. Good Bye. Have a Great day.