Principal Component Analysis
Hello! In this blog, we will be looking at the most important Dimensionality Reduction Technique i.e. Principal Component Analysis.
Using PCA, we can find the correlation between data points, such as whether the Summer effect the sale of ice cream or by how much. In PCA we will be generating a covariance matrix to check the correlation, but let’s start it from scratch.
As we said earlier, PCA is the Dimensionality Reduction technique, so first take a look at how to reduce the dimensions.
But, Why do we need to reduce dimensions?
PCA tries to remove the curse in any ML project, i.e. OVERFITTING. Overfitting is a problem generated when the model is too accurate in training data, i.e. the model perfectly fits all the points on a training dataset. The reduce this overfitting, we generate the Principal Components.
Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components. So, the idea is that 10-dimensional data gives you 10 principal components, but PCA tries to put the maximum possible information in the first component, then the maximum remaining information in the second and so on.
Here, in the below image, I have plotted the best fit(overfitted) model with the data points, given two attributes X & Y. We will be generating the principal components by viewing the model from different directions.
PC1 — First Principal component(generated from view 1)
PC2 — Second Principal component(generated from view 2)
As you can see in the above image, we tried to reduce the 2-Dimension Model to 1-Dimension by generating its principal components as per different views. As per the note, the principal components generated should be less than or equal to the total attributes. Also remember, the components generated should hold orthogonal property i.e. each of the components should be independent of each other.
The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent.
Now, after finding the principal components, we will find Co-Variance Matrix. This step aims to understand how the variables of the input data set are varying from the mean concerning each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, to identify these correlations, we compute the covariance matrix. For 2 attributes 2x2 matrix is generated and for 3, 3x3 is generated.
What do the values of the covariance matrix tell us?
It’s the sign of the covariance that matters :
- if positive then: the two variables increase or decrease together (correlated)
- if negative then: One increases when the other decreases (Inversely correlated)
The Next Step is to compute the EIGENVECTORS and EIGENVALUES of the COVARIANCE MATRIX to identify the PRINCIPAL COMPONENTS.
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the covariance matrix to determine the principal components of the data. Lemme explain most easily via below image:
So, here we first found the two Eigenvalues by computing with the previous co-variance matrix and solving the quadratic equation.
Then we just kept both of these values(lambda1, lambda2) again in the co-variance matrix in place of lambda and found the Eigenvectors X1, Y1 & X2, Y2. These Eigenvectors actually are the Principal Components. But, what is their priority?
The priority of Eigenvectors as Principal Components is found according to the values of lambda. The greater the value of lambda, the greater its priority in the list of Eigenvectors or Principal Components.
Implementation of Principal Component Analysis in Python
#Implementation of PCA is very easy in python. Implementation is done before training the model to reduce the dimensionality of the training and testing dataset.# Applying PCA function on training
# and testing set of X component
# Note: Applying dataset to StandarScaler() is must for PCA.from sklearn.decomposition import PCApca = PCA(n_components = 2)X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
The problem that can arise from using PCA
Information Loss:
Although Principal Components try to cover maximum variance among the features in a dataset, if we don’t select the number of Principal Components with care, it may miss some information as compared to the original list of features.
Independent variables become less interpretable:
After implementing PCA on the dataset, your original features will turn into Principal Components. Principal Components are the linear combination of your original features. Principal Components are not as readable and interpretable as original features.
Summary
Thus, in this blog, we learned to treat overfitting by reducing dimensionality using PCA. We saw the inner algorithm of generating co-variance vectors and principal components. And, in the end, we implemented it with Python. Finally, we saw the problems that might arise due to PCA.