K-Nearest Neighbor (KNN)
Complete understanding of KNN in a simple language with mathematical intuition and Python implementation.
K-Nearest Neighbor is one of the easiest and beginner-friendly algorithms to learn. Just a few points to know and this one is complete, yes it’s really easy. The point to be noted is that “KNN works well mostly on large datasets”. Important points related to this:
- KNN is the simplest Machine Learning Algorithm.
- It is based on the Supervised Learning technique.
- KNN finds similar data points according to their properties. As in the above figure KNN grouped red, blue & green based on their similarity.
- The k-NN algorithm stores all the available data and classifies a new data point based on the similarity. Look at the above figure, the black point is a new data point. If we take circle C1 then black->blue due to less distance, but if we take bigger circle C2 then black->red (meaning black will be classified as red).
- K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for Classification problems.
- K-NN is a non-parametric algorithm, i.e. models assume that the data distribution cannot be defined in terms of such a finite set of parameters.
- It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset
- KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.
Working of KNN
As we discussed earlier, KNN uses feature similarity to predict the values of new data points, meaning that new data point categories will be assigned on the basis of minimum distant data points. Follow steps are performed for implementing this algorithm:
- First, assign the value of K-neighbor, i.e. the nearest data points. The value of K really matters, as in the above figure, we saw that when the k value is low then black is classified as blue, but if we increase the value of k then black is classified as red. Thus, keeping k around 5–8 suits best in most cases. Try to run your code multiple times by changing k to increase accuracy.
- For each point in the test data do the following −
- −> Calculate the distance between test data and each row of training data with the help of any of the methods namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is Euclidean.
- −> Now, based on the distance value, sort them in ascending order.
- −> Next, it will choose the top K rows from the sorted array.
- −> Now, it will assign a class to the test point based on the most frequent class of these rows.
Since we know that KNN is used for both, regression and classification. But, What is the difference between KNN regression & KNN classification?
- KNN regression predicts output on the basis of the local average or mean.
- KNN classifier predicts class on the basis of probability or mode.
Main problems faced with KNN:
KNN doesn’t perform well with a large number of features, so feature engineering becomes an important part when you select the KNN algorithm. To deal with the problem of the curse of dimensionality, you need to perform principal component analysis before applying any machine learning algorithm, or you can also use a feature selection approach.
KNN Implementation with Python
#Import all important libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
import seaborn as sns#Read the dataset#Train the MODEL
K = []
training = []
test = []
scores = {}for k in range(2, 21):
clf = KNeighborsClassifier(n_neighbors = k)
clf.fit(X_train, y_train)training_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
K.append(k)training.append(training_score)
test.append(test_score)
scores[k] = [training_score, test_score]#Evaluating the model
for keys, values in scores.items():
print(keys, ':', values)#Plotting the training scores graph
ax = sns.stripplot(K, training);
ax.set(xlabel ='values of k', ylabel ='Training Score')
plt.show()#Plotting the testing scores graph
ax = sns.stripplot(K, test);
ax.set(xlabel ='values of k', ylabel ='Test Score')
plt.show()#Plotting both training and testing graph
plt.scatter(K, training, color ='k')
plt.scatter(K, test, color ='g')
plt.show()
A Quick Summary of the KNN Algorithm
- K is a positive integer whose value best suits between 2–5(small dataset) and 5–8(relatively large dataset)
- Iteratively change K to find the best results
- KNN makes predictions using the similarity between an input sample and each training instance.