K — MEANS Clustering | Data Science | ML (Part 8)
Clustering is a commonly used unsupervised Machine learning algorithm.
Learning from the Clusters formed automatically in SPACE:
As we see in the above image, the group of stars, together form a cluster. There are multiple clusters indicated, actually, I tried to show as many clusters as possible for better understanding. We clearly get an idea from the above image that grouping something forms a cluster. The same thing is done in K-MEANS clustering, by forming clusters of data i.e. grouping similar data points. In this blog, I tried to enlighten this topic in the easiest way.
First of all, Can you guess whether K-Means is supervised or unsupervised learning???
I answered this in the very first line, that the clusters are formed automatically. That means this algorithm tries to learn itself from the data and give us the desired output, therefore undoubtedly it’s an unsupervised learning algorithm. Now, we are ready to begin k-means learning beginning with clustering.
ALGORITHM
Clustering is a process of grouping data based on data-patterns observed i.e. forming cluster on basis of similarity of data. This is unique way of understanding the given data by observing the similarity of data-points.
We will be given a dataset, with certain features, and values for these features (like a vector). The task is to categorize those items by forming clusters. he algorithm will categorize the items into k groups of similarity. To calculate that similarity, we will use the Euclidean distance as a measurement.
Data-point within a cluster must be similar(i.e. near) and data-points of different clusters must be as different(i.e. far) as possible.
To achieve the above point, we will use Euclidean distance. For that, first, we will make a random centroid in each cluster. Then…
- Calculate intra-cluster distance. This is of Euclidean distance between different points to the centre of the cluster. The distance should be as minimum as possible.
- Calculate inter-cluster distance. This is the Euclidean distance between the centroids of two clusters. This distance should be as maximum as possible.
- Then calculate Dunn Index. It is a ratio of max(inter-cluster distance) / min(intra-cluster distance).
To achieve the above aim, we follow the following algorithm.
- Randomly initialize the number of clusters k.
- Select k points from each cluster as centroids.
- Calculate the Dunn index.
- Repeat the above steps again and again for better accuracy.
The “points” mentioned above are called MEANS, because they hold the mean values of the items categorized in it. To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set. Another method is to initialize the means at random values between the boundaries of the data set (if for a feature x the items have values in [0,5], we will initialize the means with values for x at [0,5]).
Another way to determine the number of clusters is through ELBOW Method.
The Elbow Method:
Increasing the value of k, we will come to a point where there will be a sudden decrease in the value of distortion (forming a tip of the elbow) at a particular value of k. This value is the best-suited value of k for maximum accuracy.
Implementation of K-Means in Python
# Important NOTE: K-Means is a distance-based algorithm, this difference of magnitude can create a problem. So let’s first bring all the variables to the same magnitude.
# standardizing the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)# defining the kmeans function with initialization as k-means++
kmeans = KMeans(n_clusters=2, init='k-means++')
# fitting the k means algorithm on scaled data
kmeans.fit(data_scaled)#ELBOW PLOTTING
# fitting multiple k-means algorithms and storing the values in an empty list
SSE = []
for cluster in range(1,20):
kmeans = KMeans(n_jobs = -1, n_clusters = cluster, init='kmeans++')
kmeans.fit(data_scaled)
SSE.append(kmeans.inertia_)# converting the results into a dataframe and plotting them
frame=pd.DataFrame({'Cluster':range(1,20),'SSE':SSE})
plt.figure(figsize=(12,6))
plt.plot(frame['Cluster'], frame['SSE'], marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')# k means using 5 clusters and k-means++ initialization
kmeans = KMeans(n_jobs = -1, n_clusters = 5, init='k-means++')
kmeans.fit(data_scaled)
pred = kmeans.predict(data_scaled)frame = pd.DataFrame(data_scaled)
frame['cluster'] = pred
frame['cluster'].value_counts()
K-Means Advantages & Disadvantages
Advantages:
- Easy to understand
- Works well for large data-set
- Adapt changes in data-set
Disadvantages :
- k value error
- Outlier affects the centroid
- Scaling issue when an increase in dimension (Can be solved with PCA)
Applications of K-Means
- Document Classification
- Customer Segmentation
- Insurance Fraud Detection
- Automatic clustering of IT Alerts
- Market Research
Summary
In this blog, we discussed the basic unsupervised machine learning algorithm. I tried to implement it from scratch and explain it in the easiest way. We also saw a few pros, cons & applications of this algorithm in the real world.
I hope this blog post helped in understanding K-Means. Comment down your thoughts, feedback or suggestions if any below. Make sure you follow me for similar content. Good Bye. Have a Great day.