Principal Component Analysis (PCA)

👋 Hello everyone

peng(DB).webp

👀 So in this blog post we are going to understand what dimensionality reduction is and how it can be reduced. So let's get started 🏄‍♂️

Data has been expanding across multiple platforms for a long period, but interpreting this data can be difficult. However, manipulating such data requires statistics to effectively reduce their dimensionality appropriately and make the most of the information in the data.

Reducing or decreasing the number of available dimensions is called “Dimensionality Reduction”.

There are two processes involved with reducing the dimensionality of stories — either removing features entirely or extracting those that require mining.

What is PCA?🤔

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is generally used to reduce the size of large data sets, by transforming a large set of variables into a smaller one that has the same amount of information as the large set.

Let's understand this by an example

Nimisha is a realtor (real estate agent) who's having trouble selling some of the property her company manages. She wonders why it hasn't been sold for as long as nine months, but changing patterns in a dataset that's big and has lots of attributes, generally called "features," can be challenging to figure out by hand.

In Nimisha’s dataset, there are mainly three features:-

  • Total number of doors in a house,

  • Total number of bedrooms,

  • The total number of bathrooms.

Having one point per total number of doors on each piece of residential property doesn't feel like it would necessarily reveal much about the data since we have other characteristics to consider like heating type or square footage, for example. Putting it statistically, the total number of doors in a house isn't critical since we have the other two features to examine.

Ok, no big deal! If the total number of doors in a house is a redundant feature, we can remove it from the dataset. This makes sense in our case, but in a dataset with hundreds or even thousands of features, we need to be more cautious.

Why there is a need for PCA?🧐

Principal component analysis (PCA) is a statistical technique used in pattern recognition, exploratory data analysis, and multivariate statistics. It is a data-reduction technique that provides a means of using the components for describing variability among the variables in a dataset. It does this by transforming a set of observed values of possibly correlated variables into a linearly uncorrelated set of values called principal components.

This transformation accomplishes two things: First, it reduces the number of variables on which further processing must be performed. **Second, **it reveals redundancy or hidden correlations among the original variables. The result is that many conclusions can be drawn based on much fewer observations than when using the original collection of variables alone; it reduces complexity while retaining information related to the variability and correlation among components.

PCA is a dimensionality reduction technique often used in data analysis that reduces the dimensionality of one or more data sets by identifying a small subset of variables responsible for explaining most of the variance in the original set(s).

Advantages of PCA

  • A lack of redundancy within the design in orthogonal components.

  • PCA reduces the dimensionality of your dataset by identifying latent correlations and relationships between variables.

  • It is beneficial to ML algorithms as it helps with the online viewing of high-dimensional datasets that would result in data being stretched out from its original form similar to a time series plot.

  • Since the maximum variation criterion has been applied, the scope of our evaluation can now be narrowed. We no longer have to take into consideration aspects of our business that may distract us from a bigger issue since we are looking at it on either a larger scale or as a whole.

  • PCA is one of the most effective ways to overcome issues that some data scientists suffer from, especially when dealing with over-fitting. It helps to decrease the number of features/variables found in a given dataset as well as improve generalization.

Implementation using Python🛠

1. Importing the libraries:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.metrics import r2_score
  • Pandas: For data analysis and manipulation.

  • NumPy: For computing complex mathematical functions.

  • Seaborn: For data visualization.

  • Matplotlib: For plotting data, and creating static, animated, and interactive visualizations.

  • Sklearn: For machine learning and statistical modeling.

2. Generating data using make_regression

In make_regression the output is generated by applying a (potentially biased) random linear regression model with n_informative nonzero regressors to the previously generated input and some gaussian centered noise with some adjustable scale.

x,y= make_regression(n_features=5,n_samples=100,n_informative=3,n_targets=1)

data={'col_1':x[:,0],'col_2':x[:,1],'col_3':x[:,2],'col_4':x[:,3],'col_5':x[:,4],'col_v':y}

df=pd.DataFrame(data)

print(df)

#%%
features=['col_1','col_2','col_3','col_4','col_5']
target='col_v'

x=df[features]
y=df[target]

image.png

3. Plotting pair plot

sns.set_style('whitegrid')
sns.pairplot(df,hue='col_v',diag_kind = 'kde',
             plot_kws = {'edgecolor': 'k'},height=4)

plt.show()

Output

image.png

4. Train-test-split:

Input data into a single call for splitting (and optionally subsampling) data in a one-liner.

  • Test_size: Specifies the test size it takes values between 0 and 1, its default is 0.25.

  • Train_size: Specifies the train size it takes values between 0 and 1, its default is 0.75.

  • Random_state: Controls the shuffling applied to the data before applying the split.

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=12)

5. Applying linear regression without dimensionality reduction:

  • Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset. More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality.
lin_reg = LinearRegression()
lin_reg.fit(x_train,y_train);
y_liner_pred=lin_reg.predict(x_test)
r2_linear=r2_score(y_test,y_liner_pred)
  • R2 Score:

image.png

6. Applying dimensionality reduction from 5 dimensions to 3 dimensions using PCA:

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
x_pca=pca.fit_transform(x)
#%%
x_train_pca, x_test_pca, y_train_pca, y_test_pca = train_test_split(x_pca, y, random_state=12) 

#%%
lin_reg_pca = LinearRegression()
lin_reg_pca.fit(x_train_pca,y_train_pca)
#%%
y_liner_pred_pca=lin_reg_pca.predict(x_test_pca)

r2_linear_pca=r2_score(y_test_pca,y_liner_pred_pca)
#%%
print("R2 score for linear regression:",r2_linear)
print("R2 score for linear regression with pca:",r2_linear_pca)

Output

image.png