The Azimuth Project
Principal component analysis (Rev #4)

Contents

Idea

According to Wikipedia:

Principal component analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has as high a variance as possible (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to (uncorrelated with) the preceding components. Principal components are guaranteed to be independent only if the data set is jointly normally distributed. PCA is sensitive to the relative scaling of the original variables.

Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT), the Hotelling transform or proper orthogonal decomposition (POD).

PCA was invented in 1901 by Karl Pearson. Now it is mostly used as a tool in exploratory data analysis and for making predictive models. PCA can be done by eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix, usually after mean centering the data for each attribute. The results of a PCA are usually discussed in terms of component scores (the transformed variable values corresponding to a particular case in the data) and loadings (the weight by which each standarized original variable should be multiplied to get the component score) (Shaw, 2003).

PCA with nn basis vectors can also be viewed as the selection of an nn dimensional subspace SS such that projection of the dataset vectors to SS has the minimal summed L 2L_2 error relative to the original dataset.

Centering

The orthogonal transformation used by PCA is implicitly assuming that the dataset is generated by process with a zero “expected value”, so that generally a “mean vector” is subtracted from all the data points. There are two approaches to acheiving this:

  1. Using analytical problem knowledge to deteremine the mean, e.g., that the mean is actually zero.

  2. Using the computed average of the data set as the “mean vector”.

The reconstruction viewpoint is relatively insensitive to the choice of mean, but techniques which use the PCA vectors as random variates can be significantly affected by the mean vector.

Basic mathematical formulation

Given an mm-dimensional dataset {y i} i=0 K\{y^{i}\}_{i=0}^K the nn dimensional PCA is a set of nn-dimensional vectors {λ i} i=0 K\{\lambda^{i}\}_{i=0}^K which can be used to approximate the original yys using

y i=μ+ j=1 nλ j iv j= j=0 nλ j iv j y^i=\mu+\sum_{j=1}^n \lambda^i_j \mathbf{v}^j=\sum_{j=0}^n \lambda^i_j \mathbf{v}^j

where the second expression absorbs the mean by having λ 0 i\lambda^i_0 is identically 11 and v 0=μ\mathbf{v}^0 = \mu. The crucial property is that, other than the mean vector, all the v\mathbf{v}s are orthonormal, i.e., v iv j=δ ij\mathbf{v}^i \cdot \mathbf{v}^j=\delta_{ij}. In addition to optimising the information content for a given number of vectors, this orthogonality gives rise to other useful calculational properties of PCA.

References