In machine learning, you have high dimensionality datasets that have a large number of features. High dimensionality datasets pose a number of problems — the most common being overfitting, which reduces the ability to generalize beyond what is in the training set. As such, you should employ dimensionality reduction techniques to reduce the number of features in your dataset. Principal Component Analysis (PCA) is one such technique.

What is Principal Component Analysis?

In machine learning, principal component analysis (PCA) is the most widely used algorithm for dimensionality reduction. It works by identifying the hyperplane closest to the dataset, and then it simply projects the data onto it. PCA selects the axis that preserves the maximum amount of variance because it is the axis that minimizes the root mean square error between the original data and its projections on the axis.

It identifies the axis that represents the greatest amount of variance in the training data. Then it also finds the second axis which is orthogonal to the first axis, this represents the greatest amount of the remaining variance.

For each main component, the PCA finds a unit vector centred at zero pointing in the direction of the main component. Since the two opposite vectors lie on the same axis, the directions of the unit vectors returned by PCA are not stable. In some cases, a pair of unit vectors may rotate or flip, but the plane they represent will generally remain the same.

The key aim of PCA is to reduce the number of variables of a data set, while preserving as much information as possible.

Imagine you are at a concert and you want to capture the atmosphere by taking a photo. Instead of capturing the atmosphere in 3 dimensions, a photo can only capture in 2 dimensions. While this reduction in dimension causes you to lose some details, you are still able to capture most of the information. For example, the relative size of a person in a photo tells us who is standing in front and who is standing behind. Therefore, a 2D image will still enable us to encode most of the information that would otherwise be only available in 3D.

Assumptions in PCA

  • There must be linearity in the data set, i.e. the variables combine in a linear manner to form the dataset. The variables exhibit relationships among themselves.
  • PCA assumes that the principal component with high variance must be paid attention and the PCs with lower variance are disregarded as noise. Pearson correlation coefficient framework led to the origin of PCA, and there it was assumed first that the axes with high variance would only be turned into principal components.
  • All variables should be accessed on the same ratio level of measurement. The most preferred norm is at least 150 observations of the sample set with a ratio measurement of 5:1.
  • The feature set must be correlated and the reduced feature set after applying PCA will represent the original data set but in an effective way with fewer dimensions.

Types of PCA

  • Regular PCA: The Regular PCA is the default version, but it only works if the data fits in memory.
  • Incremental PCA: Incremental PCA is useful for large datasets that will not fit into the memory of regular PCA, but it is slower than the regular PCA, so if the data fits in memory you should use the regular PCA.
  • Randomized PCA: Randomized PCA is very useful when you want to drastically reduce dimensionality and the dataset fits in memory. In such cases, it works faster than the regular principal component analysis.
  • Kernal PCA: The Kernal PCA is only preferred when the dataset is nonlinear.

Advantages of using PCA

  • Removes correlated features. PCA will help you remove all the features that are correlated, a phenomenon known as multi-collinearity. Finding features that are correlated is time consuming, especially if the number of features is large.
  • Improves machine learning algorithm performance. With the number of features reduced with PCA, the time taken to train your model is now significantly reduced.
  • Reduce overfitting. By removing the unnecessary features in your dataset, PCA helps to overcome overfitting.

Disadvantages of using PCA

  • Independent variables are now less interpretable. PCA reduces your features into smaller number of components.
  • Information loss. Data loss may occur if you do not exercise care in choosing the right number of components.

In summary

Companies and organizations use a dimension reduction method, like PCA, to condense a large data set into one that’s more manageable and easier to use.
The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components, which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables.

Recommended for you:

Leave a Reply

Your email address will not be published. Required fields are marked *

Concepts of Data Science

January 5, 2023