A short summary of PCA (Principal Component Analysis)

Application

PCA is the most common dimensionality reduction technique. It is primarily used to compress data to lower dimensions for various reasons including:

  • Speeding up of algorithm

  • Visualization

Caveat

Superficially it may seem that we are eliminating features by applying PCA. But that is not the case. PCA compresses a vector from a higher dimension to a lower dimension. Say for example, an n*1 feature vector to a k*1 feature vector, where n > k. So, we are not altogether deleting some features rather trying to compress all the features to a lesser dimension with minimal data loss. That is why PCA should not be used to address overfitting, rather regularization should be used, preferably L2 regularization also known as ridge regression. And for feature selection-that is to eliminate some features- L1 regularization may be used which is also known as Lasso.

So now on to what is under the hood in PCA

We will not delve into the mathematical workings of the algorithm as it is out of the scope of this article. We are going to go through the algorithm and talk about application related issues only.

Algorithm

Preprocessing

  • Perform mean normalization

    Replace each with , so that the mean is rescaled to 0. is the mean of the column j.

  • Perform feature scaling

    If the features have very differents scales then we need to do feature scaling using this equation:

    where is some measure of the range. It can be:

    • Max - Min
    • Standard Deviation (more commonly used)

Computing the Covariance Matrix

  • Compute covariance matrix,

Remember, will be a n*n matrix as would be n * 1 vector. This can be computed in python with one line of code for the whole feature matrix using numpy’s matrix implementation libraries.

  sigma = (1 / m )* (X' * X)

where X is the feature matrix.

Compute the Eigen vector

  • Compute Eigen vector using SVD(Single Value Decomposition). We can compute the Eigen vector of the covariance matrix sigma in python using numpy or scipy’s svd function.
    (U,S,V) = linalg.svd(sigma)

Get the reduced dimension feature vectors

  • Take k (k is the dimension we are reducing to) columns of the U matrix, call this U_reduce.
      U_reduce = U(:,1:k)
    

    So U_reduce will be an n * k matrix.

  • Compute z with the following equation:

is a k * n matrix, x is a n *1 vector, so z will be a k*1 vector. We can do this for all x in one go with the following code:

      Z = U_reduce' * X

Deciding on k

How to determine which value of k to use? That is, to what dimension should we reduce our vectors to? For visualization purposes we usually use k=2 or 3 so that we can plot the data on 2D or 3D. But for other use cases like speeding up algorithm we need to know the minimum value of k that we can get to without losing out on a lot of information.

  • PCA tries to minimize the average squared projection error.

  • Total variation in data can be measured as the average over data saying how far the training examples are from the origin.

The ratio of these two metrics can help us choose the value of k.

Smaller ratio (<0.01 that is 1%) indicates that approximately 99% of the variance of the original data is retained. So, the information loss due to dimensionality reduction is minimal.The ratio will be small if the average squared projection error is small, that is, the projection of the data points is very similar to their higher dimensional counterparts.

So to sum up the whole thing:

  • Do preprocessing (mean normalization and feature scaling)

  • Compute PCA starting with k=1 or any desired value

  • Compute (or , and ‘s

  • Check if
  • Repeat with an incremented k until the above equation is satisfied.

Reconstruction

We can reverse the procedure to reconstruct an approximation of the higher dimensional representation. We computed z to be:

So, to reconstruct x we can do:


U_reduce is an n * k matrix, z is a k * 1 vector. So, will be an n * 1 vector.

PCA can be used in this way to help with visualization for higher dimensional data ( by using k =2 or 3) which can help us understand the data better. It can also be used to bring down the number of features to speed up any ML algorithm. Typically data dimensionality can be reduced 5-10X without causing any significant consequence to the accuracy of any ML algorithm.