A lot has been said about PCA and its application to dimension reduction in large datasets, but I have found that the concept covers far more than the screentime it gets. Yeah, I think it’s more than just dimension reduction. Simply put, PCA is a statistical procedure that transforms a set of observations (which could be correlated variables) into a new set of uncorrelated variables called principal components. Post-transformation, these components capture the most significant information or patterns present in the original data.
In PCA, the transformation is done in such a way that the first principal component accounts for the largest possible variance, the second principal component accounts for the remaining variance, and so on. By retaining a subset of these principal components, dimensionality of the data is reduced while preserving its important features.
SVD, on the other hand is a matrix factorization technique that decomposes a matrix into three separate matrices that are easy to manipulate and to analyze. It lays down the foundation to untangle data into independent components.
Suppose we have a rectangular matrix, A which can be factored into:
where U: is the left-singular vectors matrix. It captures the orthogonal basis of the column space of the original matrix.
Σ: is the singular values matrix. It is a diagonal matrix with non-negative elements that represent the importance or significance of each singular value.
V^T: is the right-singular vectors matrix. It captures the orthogonal basis of the row space of the original matrix.
Let’s multiply A transpose by A, we get:
Note that: U transpose U is an identity so we are left with V and V transpose which are the eigenvectors and theta squared which are the (non-zero) eigenvalues of A transpose A.
Also note that the matrix A transpose A is symmetric, square, and positive definite.
How are these two concepts related?
The relationship of these concepts rests mainly on the derivation of PCA from the SVD framework. This interconnectedness is seen in some of the components of SVD, some of which are outlined below:
U (left-singular vectors): The columns of U from SVD are the eigenvectors of the covariance matrix — it is a Hermitian matrix because it’s features are real and symmeric — associated with PCA. These eigenvectors form a set of orthogonal axes that capture the directions of maximum variance in the data.
Σ (singular values): The singular values in Σ are directly related to the eigenvalues obtained from PCA. Specifically, the squared singular values (after normalization) are proportional to the eigenvalues of the covariance matrix.
V^T (right-singular vectors): Though not directly used in PCA, V^T provides additional information about the relationships between the original features and the principal components.
The principal components obtained through PCA capture the directions of maximum variance in the data. These principal components are equivalent to the eigenvectors derived from SVD. By selecting the top-k principal components (corresponding to the largest eigenvalues) derived from PCA or the largest singular values from SVD, we can retain the most significant information while reducing the dimensionality of the dataset.
In practice, the steps to perform PCA are often derived from the steps of SVD. Before applying PCA, the data is typically centered and standardized to have zero mean and unit variance. The covariance matrix is then computed, associating it with the original data matrix. PCA proceeds by decomposing the covariance matrix into its eigenvectors (U) and eigenvalues, which are related to the singular values of SVD. These singular values provide insights into the amount of variance explained by the principal components. The larger the eigenvalues, the more variance in the data is captured by the corresponding principal components.
This allows for a crucial aspect of PCA, where one can choose to retain a subset of the principal components that explain a significant portion (e.g. 90%) of the total variance. By doing so, dimensionality of the data can be effectively reduced while still preserving a considerable amount of information. It is pertinent to note that PCA assumes a linear relationship between features. It’s therefore advisable to transform any non-linear features into linear features before applying PCA.
Summarily, both concepts are intricately connected. SVD provides the mathematical framework for PCA. Singular Value Decomposition helps in performing dimensionality reduction by selecting the most significant singular values, which relate to the variance explained by the principal components, and understanding this relationship helps in applying Principal Componet Analysis effectively.
Learned a lot about PCA. I've used it a lot without really thinking of the inner workings. Thank you for sharing.