I recently ran a data science training course on the topic of principal component analysis and dimension reduction. This course was less about the intimate mathematical details, but rather on understanding the various outputs that are available when running PCA. In other words, my goal was to make sure that followers of this tutorial can see what terms like “explained_variance_” and “explained_variance_ratio_” and “components_” mean when they probe the PCA object. It shouldn’t be a mystery and it should be something that anyone can recreate “by hand”.
My training sessions tend to be fluid and no one session is the same as any other. Thus, this write-up is more a “paraphrasing” rather than an exact recount of events.
In this session, I created a four-dimensional data set with random pairwise covariances. Then I showed how to de-correlate this data with PCA and how eigenvectors and eigenvalues relate to the typical vocabulary that is encountered when reporting output of PCA.
If you or your company are interested in a hands-on session or course, send me an email or leave a comment below.
import math
import random
import itertools
import pandas
import numpy
import scipy
import scipy.stats
import scipy.linalg
import sklearn
import sklearn.linear_model
import sklearn.decomposition
import matplotlib
import matplotlib.pyplot as plt
generate matrix of random numbers and a diagonal matrix for standard deviations¶
size = 4
randmatrix = numpy.matrix(numpy.random.random((size,size))*2-1)
sigmas = numpy.matrix(numpy.identity(size)*[3,10,.4,.09])
generate a random covariance matrix¶
z = numpy.matrix(scipy.linalg.orth(randmatrix))
cov = z.T*sigmas*z
compute correlation matrix¶
temp = numpy.matrix(numpy.identity(size)*numpy.diagonal(cov)**-0.5)
corr = temp*cov*temp
corr
generate multivariate normal random numbers with specified means and covariance matrix¶
means = [4,20,3,8]
rvs = numpy.random.multivariate_normal(means,cov.tolist(),1000)
eyeball empirical correlation matrix against theoretical from above¶
empirical_corr = numpy.corrcoef(rvs.T)
empirical_corr
plot all pairwise relationships and eyeball behavior against correlation matrix¶
fig,ax = plt.subplots(nrows = 3, ncols = 2, figsize = (6,9), gridspec_kw = {'height_ratios':[1]*3,'width_ratios':[1]*2})
fig.subplots_adjust(wspace = 0.4, hspace = 0.4)
for i,(a,b) in enumerate(itertools.combinations([i for i in range(4)],2)):
row = i//2
col = i%2
ax[row][col].scatter(rvs[:,a],rvs[:,b],marker = '.')
ax[row][col].set_xlabel('dim {a}'.format(a = a))
ax[row][col].set_ylabel('dim {b}'.format(b = b))
ax[row][col].set_title('corr: {corr:.3f}'.format(corr = empirical_corr[a][b]))
sklearn PCA, fit and transform data¶
pca = sklearn.decomposition.PCA(n_components = size)
pca.fit(rvs)
transformedrvs = pca.transform(rvs)
pca.explained_variance = eigenvalues of covariance matrix¶
Note the use of \(\texttt{ddof = 0}\). For reasons, I haven’t yet been able to decipher, \(\texttt{pca.explained_variance}\) and \(\texttt{eigenvalues}\) coincide when \(\texttt{ddof = 0}\) and sometimes when \(\texttt{ddof = 1}\) (sampling penalty). Both are given here for the reader’s benefit.
eigenvalues, eigenvectors = scipy.linalg.eig(numpy.cov(rvs.T,ddof=0))
eigenvalues, pca.explained_variance_, scipy.linalg.eig(numpy.cov(rvs.T,ddof=1))[0]
observed covariance matrix of transformed data is diagonal matrix of eigenvalues¶
numpy.cov(transformedrvs.T,ddof=0), numpy.diagonal(numpy.cov(transformedrvs.T,ddof=0)), eigenvalues, pca.explained_variance_
pca.components_ = eigenvectors of covariance matrix (up to sign change)¶
eigenvectors.T, pca.components_
verify eigenvector matrix is orthogonal \((Q^{T}Q = I)\)¶
numpy.matrix(eigenvectors.T)*numpy.matrix(eigenvectors)
pca.explained_variance_ratio_ for component \(i\) is \(\frac{\lambda_{i}}{\sum_{i}\lambda_{i}}\)¶
eigenvalues/eigenvalues.sum(), pca.explained_variance_ratio_, pca.explained_variance_/pca.explained_variance_.sum()
principal components are uncorrelated¶
numpy.corrcoef(transformedrvs.T)
verify that variance of transformed data equals the eigenvalues of the covariance matrix¶
transformedrvs.std(axis = 0, ddof = 0)**2, eigenvalues, pca.explained_variance_
visually verify that transformed data is pairwise uncorrelated¶
fig,ax = plt.subplots(nrows = 3, ncols = 2, figsize = (6,9), gridspec_kw = {'height_ratios':[1]*3,'width_ratios':[1]*2})
fig.subplots_adjust(wspace = 0.4, hspace = 0.4)
for i,(a,b) in enumerate(itertools.combinations([i for i in range(size)],2)):
row = i//2
col = i%2
ax[row][col].scatter(transformedrvs[:,a],transformedrvs[:,b],marker = '.')
ax[row][col].set_xlabel('dim {a}'.format(a = a))
ax[row][col].set_ylabel('dim {b}'.format(b = b))
Notice in the above image of the six plots, the range of dimension 0 is the greatest, followed by dimension 1, then dimension 2, and finally dimension3. This should make sense since dimension 0 is the first principal component and thus explains the greatest variance in the data. Then each subsequent principal component explains a smaller amount of the variation. We can see from the explained_variance_ratio_ that the first two principal components explain roughly 96% of the variation in the data. So from a dimension reduction standpoint we can consider the first two principal components.
I hope this article helped you see how eigenvectors and eigenvalues fit in to the principal component analysis problem.