User’s Guide : Basic Data Analysis : Groups : Principal Components

Principal Components
Principal components analysis models the variance structure of a set of observed variables using linear combinations of the variables. These linear combinations, or components, may be used in subsequent analysis, and the combination coefficients, or loadings, may be used in interpreting the components. While we generally require as many components as variables to reproduce the original variance structure, we usually hope to account for most of the original variability using a relatively small number of components.
We may, for example, have a very large number of variables describing individual health status that we wish to reduce to a manageable set. By forming linear combinations of the observed variables we may achieve data reduction by creating a handful of measures that describe overall health (e.g., “strength,” “fitness,” “disabilities”). The coefficients in these linear combinations may be used to provide interpretation to the newly constructed health measures.
The principal components of a set of variables are obtained by computing the eigenvalue decomposition of the observed variance matrix. The first principal component is the unit-length linear combination of the original variables with maximum variance. Subsequent principal components maximize variance among unit-length linear combinations that are orthogonal to the previous components.
For additional details see Johnson and Wichtern (1992).
Performing Principal Components
EViews allows you to compute the principal components of the estimated correlation or covariance matrix of a group of series, and to display your results in a variety of ways. You may display the table of eigenvalues and eigenvectors, display line graphs of the ordered eigenvalues, and examine scatterplots of the loadings and component scores. Furthermore you may save the component scores and corresponding loadings to the workfile.
As an illustration, we again consider the stock price example from Johnson and Wichtern (1992) in which 100 observations on weekly rates of return for Allied Chemical, DuPont, Union Carbide, Exxon, and Texaco were examined over the period from January 1975 to December 1976 (“Stocks.WF1”).
To perform principal components on these data, we open the group G1 containing the series and select View/Principal Components... to open the dialog: The principal components dialog has two tabs. Here, we have selected the first tab, labeled Components. The second tab, labeled Calculation, controls the computation of the dispersion matrix from the series in the group. By default, EViews will perform principal components on the ordinary (Pearson) correlation matrix, but you may use the settings on this tab to modify the preliminary calculation. We will examine this tab in greater detail in “Covariance Calculation”.
Viewing the Components
The Components tab is used to specify options for displaying the components or saving the eigenvalues and eigenvectors of the variances. The Display box allows you to choose between showing the eigenvalues and eigenvectors in a tabular form, or displaying line graphs of the ordered eigenvalues, or scatterplots of the loadings, scores, or both (biplot). As you select different display methods, the remainder of the dialog will change to provide you with different settings.
Table
In the figure above, the Table display setting is chosen. There are two sets of fields that you may wish to modify. First, EViews provides you with three settings for controlling the number of components to be displayed; the number displayed will be the minimum number satisfying any of the criteria. The Maximum number setting should be self-explanatory. The Minimum eigenvalue instructs EViews to only show results for components where the eigenvalue (variance) exceeds a threshold. The Cumulative proportion target tells EViews to retain the first components such that the sum of their proportion of the variances meets or exceeds the target proportion of the total variance. By default, the settings are chosen so that all components will be retained.
The Output fields allow you to save the eigenvalues and eigenvectors to the workfile. Simply enter a valid name in the corresponding field if you wish EViews to save your results.
If we leave the default settings as is and click OK, EViews will display a table of results. Here we show the top two sections of the table. The header describes the sample of observations, the method used to compute the dispersion matrix, and information about the number of components retained (in this case, all five).
The next section summarizes the eigenvalues, showing the values, the forward difference in the eigenvalues, the proportion of total variance explained, etc. Since we are performing principal components on a correlation matrix, the sum of the scaled variances for the five variables is equal to 5. The first principal component accounts for 57% of the total variance (2.856/5.00 = 0.5713), while the second accounts for 16% (0.809/5.00 = 0.1618) of the total. The first two components account for over 73% of the total variation.
The second section describes the linear combination coefficients. We see that the first principal component (labeled “PC1”) is a roughly-equal linear combination of all five of the stock returns; it might reasonably be interpreted as a general stock return index. The second principal component (labeled “PC2”) has negative loadings for the three chemical firms (Allied, du Pont and Union Carbide), and positive loadings for the oil firms (Exxon and Texaco). This loading appears to represent an industry specific component.
The third section of the output displays the calculated correlation matrix: Eigenvalues Plots You may elect to display line graphs of the ordered eigenvalues by selecting Eigenvalues plots in the Display portion of the main dialog. The dialog will change to offer you the choice of displaying plots of any of: the eigenvalues (scree plot), the eigenvalues difference, the cumulative proportion of variance explained. By default, EViews will only display the scree plot of ordered eigenvalues.
For the stock data, displaying the scree and cumulative proportion graphs yields the graph depicted here. The scree plot in the upper portion of the view shows the sharp decline between the first and second eigenvalues. Also depicted in the graph is a horizontal line marking the mean value of the eigenvalues (which is always 1 for eigenvalue analysis conducted on correlation matrices).
The lower portion of the graph shows the cumulative proportion of the total variance. As we saw in the table, the first two components account for about 73% of the total variation. The diagonal reference line offers an alternative method of evaluating the size of the eigenvalues. The slope of the reference line may be compared with the slope of the cumulative proportion; segments of the latter that are steeper than the reference line have eigenvalues that exceed the mean.
Other Graphs (Variable Loadings, Component Scores, Biplots)
We continue our example by displaying the biplot graph since it includes the options for both the loadings and scores plots. If we select the Biplots (scores and loadings) entry, the right side of the dialog changes to provide additional plot options.
Components to Plot The top right portion of the dialog, labeled Components to plot, is where you will provide the basic specification for the graphs that you want to display.
First, you must provide a list of components to plot. Here, the default setting “1 2” instructs EViews to place the first component on the x-axis and the second component on the y-axis. You may reverse the order of the axes by reversing the indices.
You may add indices for additional components. When more than two indices are provided, the Multiple graphs setting provides choices for how you wish to process the indices. You may elect to plot the first listed component against the remaining components (First vs. All), to use successive pairs of indices to form plots (XY pairs), or to plot each component against the others (Lower triangular matrix).
In the latter three cases, you will be prompted to indicate whether you wish to adjust the results account for the sample size (Adjust scores & loadings for sample size). By default, EViews uses this setting and scales the loadings and scores so that the variances of the scores (instead of the norms) have the desired structure (see “Observation Scaling”). Setting this option may improve the interpretability of the plot. For example, when normalizing scores, the weight adjustment scales the results so that the Euclidean distances between observations are the Mahalanobis distances and the cosines of the angles between variables are the covariances.
Using the default settings and clicking on OK, EViews produces the view: The component scores are displayed as circles and the variable loadings are displayed as lines from the origin with variable labels. The biplot clearly shows us that the first component has positive loadings for all five variables (the general stock return index interpretation). The second component has positive variable loadings for the energy stocks, and negative loadings for the chemical stocks; when the energy stocks do well relative to the chemical stocks, the second specific component will be positive, and vice versa.
The scores labels show us that observation 3 is an outlier, with a high value for the general stock market index, and a relatively neutral value for the sector index. Observation 37 shows a poor return for the general market but is relatively sector neutral. In contrast, observation 20 is a period in which the overall market return was positive, with high returns to the energy sector relative to the chemical sector.
Graph Options There are three additional options provided under Graph options. The first option is to Center graphs around zero. Unchecking this box will generally enlarge the graph within the frame at the expense of making it somewhat more difficult to quickly discern the signs of scores and loadings in a given dimension.
The Obs. labels dropdown allows you to choose the style of text labeling for observations. By default, EViews will Label outliers, but you may instead choose to Label all obs. or to display Symbols only. If you choose to label outliers, EViews will use a cutoff based on the specified probability value for the Mahalanobis distance of the observation from 0. The default is 0.1 so that labeled observations differ from the 0 with probability less than 0.1.
The last option, Loadings axis scaling, is available only for biplot graphs. Note that the observations and variables in a biplot will generally have very different data scales. Instead of displaying biplots with dual scales, EViews applies a constant scaling factor to the loadings axes to make the graph easier-to-read. Loadings axis scaling allows you to override the EViews default scale for the loadings in two distinct ways.
First, you may instruct EViews to apply a scale factor to the automatically chosen factor. This method is useful if you would like to stretch or shrink the EViews default axes. With the Loadings axis scaling set to Automatic, simply enter your desired adjustment factor. The automatically determined loadings will be scaled by this factor.
Alternatively, if you wish to assign an absolute scaling factor, select User-specified for the axis scaling, and enter your scale factor. The original loadings will be scaled by this factor.
Saving Component Scores
EViews provides easy-to-use tools for saving the principal components scores and scaled loadings matrices in the workfile. Simply select Proc/Make Principal Components... from the main group menu to display the dialog. As with the main principal components view, the dialog has two tabs. The second tab controls the calculation of the dispersion matrix. The first describes the results that you wish to save.
For the latter three selections, you are also given the option of adjusting the scores and loadings for the sample size. If Adjust scores & loadings for sample size is selected, the scores are scaled so that their variance rather than the sums-of-squares (norms) match the desired value. In this example, the sample variances of the component scores will equal 1.
Next, you should enter names for the score series, one name per component score you wish to save. Here we enter two component names, “Market” and “Industry,” corresponding to the interpretation of our components given above. You may optionally save the loadings corresponding to the saved scores, eigenvalues, and eigenvectors to the workfile.
Covariance Calculation
The EViews routines for principal components allow you to compute the dispersion matrix for the series in a group in a number of ways. Simply click on the Calculation tab to display the preliminary calculation settings. The Type dropdown allows you to choose between computing a Correlation or a Covariance matrix.
The Method dropdown specifies computation of Ordinary, Ordinary (uncentered), Spearman rank-order or Kendall’s tau-a, or Kendall’s tau-b measures. The Type selection dropdown is not applicable if you select Kendall’s tau-a or Kendall’s tau-b as your method.
The remaining settings should be familiar from the covariance analysis view (“Covariance Analysis”). You may, for example, specify the sample of observations to be used and perform listwise exclusion of cases with missing values to balance the sample if necessary. Or you can perform partial and/or weighted analysis.
Note that component scores may not be computed for dispersion matrices estimated using Kendall’s tau-a and tau-b.
Technical Discussion
From the singular value decomposition, we may represent a data matrix of rank as: (12.27)
where and are orthonormal matrices of the left and right singular vectors, and is a diagonal matrix containing the singular values.
More generally, we may write: (12.28)
where is an , and is a matrix, both of rank , and (12.29)
so that is a factor which adjusts the relative weighting of the left (observations) and right (variables) singular vectors, and the terms involving are scaling factors where . The basic options in computing the scores and the corresponding loadings involve the choice of (loading) weight parameter and (observation) scaling parameter .
In the principal components context, let be the cross-product moment (dispersion) matrix of , and perform the eigenvalue decomposition: (12.30)
where is the matrix of eigenvectors and is the diagonal matrix with eigenvalues on the diagonal. The eigenvectors, which are given by the columns of , are identified up to the choice of sign. Note that since the eigenvectors are by construction orthogonal, .
We may set , , and , so that: (12.31) may be interpreted as the weighted principal components scores, and as the weighted principal components loadings. Then the scores and loadings have the following properties: (12.32)
Through appropriate choice of the weight parameter and the scaling parameter , you may construct scores and loadings with various properties (see “Loading Weights” and “Observation Scaling”). EViews provides you with the opportunity to choose appropriate values for these parameters when displaying graphs of principal component scores and loadings and when saving scores and loadings to the workfile.
Note that when computing scores using Equation (12.33), EViews will transform the to match the data used in the original computation. For example, the data will be scaled for analysis of correlation matrices, and partialing will remove means and any conditioning variables. Similarly, if the preliminary analysis involves Spearman rank-order correlations, the data are transformed to ranks prior to partialing. Scores may not be computed for dispersion matrices estimated using Kendall’s tau.
At one extreme, we define the normalized loadings (also termed the form, or JK) decomposition where . The scores formed from the normalized loadings decomposition will have variances equal to the corresponding eigenvalues. To see this, substituting into Equation (12.31), and using Equation (12.28) we have , where: (12.33)
From Equation (12.32), the scores and loadings have the norms: (12.34)
The rows of are said to be in principal coordinates, since the norm of is the diagonal matrix with the eigenvalues on the diagonal. The columns of are in standard coordinates since is orthonormal (Aitchison and Greenwood, 2002, p. 378). The JK specification has a row preserving metric (RPM) since the observations in retain their original scale.
At the other extreme, we define the normalized scores (also referred to as the covariance or GH) decomposition where . Then we may write where: (12.35)
Evaluating the norms using Equation (12.32), we have: (12.36)
For this factorization, is orthonormal (up to a scale factor) and the norm of is proportional to the diagonal matrix with the times the eigenvalues on the diagonal. Thus, the specification is said to favor display of the variables since the loadings are in principal coordinates and the scores are in standard coordinates (so that their variances are identical). The GH specification is sometimes referred to as the column metric preserving (CMP) specification.
In interpreting results for the GH decomposition, bear in mind that the Euclidean distances between observations are proportional to Mahalanobis distances. Furthermore, the norms of the columns of are proportional to the factor covariances, and the cosines of the angles between the vectors approximate the correlations between variables.
Obviously, there are an infinite number of alternative scalings lying between the extremes. One popular alternative is to weight the scores and the loadings equally by setting : This specification is the SQ or symmetric biplot, where : (12.37)
Evaluating the norms of the scores and loadings , we have: (12.38)
so that the norms of both the observations and the variables are proportional to the square roots of the eigenvalues.
Observation Scaling
In the decompositions above, we allow for observation scaling of the scores and loadings parameterized by . There are two obvious choices for the scaling parameter .
First, we could ignore sample size by setting so that: (12.39)
With no observation adjustment, the norm of the scores equals , the variance of the scores equals , and the norm of the variables equals times the eigenvalues raised to the power. Note that the observed variance of the scores is not equal to, but is instead proportional to , and that the norm of the loadings is only proportional to .
Alternately, we may set , yielding: (12.40)
With this sample size adjustment, the variance of the scores equals and the norm of the variables equals .
Gabriel (1971), for example, recommends employing a principal components decomposition for biplots that sets . From Equation (12.32) the relevant norms are given by: (12.41)
By performing observation scaling, the scores are normalized so that their variances (instead of their norms) are equal to 1. Furthermore the Euclidean distances between points are equal to the Mahalanobis distances (using ), the norms of the columns of are equal to the eigenvalues, and the cosines of the angles between the vectors equal the correlations between variables. Without observation scaling, these results only hold up to a constant of proportionality.
By default, EViews performs observation scaling, setting . To remove this adjustment, simply uncheck the Adjust scores & loadings for sample size checkbox. Note that when EViews performs this adjustment, it employs the denominator from the original dispersion calculation which will differ from if any degrees-of-freedom adjustment has been applied.