Advanced Statistics - Biology 603 |
Bowling Green State University, Spring 2008 |
Latent variable, Intervening variable: a theoretical variable hypothesized to influence a number of observed variables. Description of a relationship between variables, body length (X), body width (Y), and body depth (Z), knowing any of these gives you a lot of predictability from one to the other. Hypothetical construct: What explains the relationship? Which process can account for the use of intervening variables - body size.
Principal Components Analysis (PCA) is one possible technique that extracts a set of factors. The goal is to interpret correlations in your data set with a set of hypothetical constructs. PCA axes are formed as linear combinations of the original variables to maximize variance accounted for by each axis.Variables are extracted in hierarchical fashion which explain the maximum amount of variance that remains at each step. After removing the variance accounted for by the last principal components axis, the next component is chosen to account for the most variance in the remaining residuals, and so on. Thus, the amount of variance explained declines with each subsequent extraction of a factor. Principal Components Analysis is a mathematical manipulation which simply recasts a set of variables into a newly derived set of independent axes. The manipulation only reorders your coordinate system into one where data points may be expressed more efficiently. All data points can be translated between "Data Space" and "Factor Space" without loss of information.
Each item is defined by its value on a series of measured variables, where a priori each variable is viewed as being independent or orthogonal from the others. Factor analysis in a way rearranges the coordinate system that defines these items in order to more accurately reflect the situation where some variables may actually correlate with each other. Correlations exist because the variables to some degree measure the same thing. For instance if we measure body length, body width, and body depth of a series of individuals we won't be greatly surprised if we find that these correlate. The goal of factor analysis is to properly account for such correlations and to thereby identify those underlying factors which may explain them. New axes are constructed as linear combinations of the underlying variables. It is possible for us to extract the same number of axes as there are original variables.
The Correlation matrix contains information about the degree to which information in different variables overlaps. As it represents a standardized variance/covariance matrix, each variable has equivalent representation in the overall picture. The Total variance contained within a dataset of n orthogonal, standardized variables (each with s = 1) is always n*1. Rank: the number of components contributing to the variance. A data matrix may be rank-deficient in the presence of excessive (perfect) correlations among the original variables. This occurrs when different variables are not linearly independent of each other (e.g., one variable is the straight sum of two other variables). In this case an inverse of the matrix cannot be formed for an extraction of eigenvectors.
Variables are by definition considered orthogonal. Variances are automatically standardized when PCA is performed on the correlation matrix.
Principal Components Analysis (PCA) is one form of factor analysis. It allows us to extract linear combinations of the original variables. Principal Components Analysis uses Eigenvalue Decomposition where 2 matrices are extracted from the n x n correlation matrix. Using a series of complex linear algebraic techniques, such as Singular Value Decomposition, a set of component matrices is extracted to satisfy the following relationship
DataMatrix = U * S2* U'.
Axes are specified by these two matrices, where the Eigenvector Matrix (U) contains information about the orientation of a series of extracted axis in our original data space, while the Eigenvalue Matrix (S2) measures each axis' size.
Eigenvectors are orthonormal and describe the axis' direction. Eigenvectors are always at right angles to each other (orthogonal) and are normalized to unit length of 1 (i.e., the squares of the elements in each eigenvector always sum to 1). The orthogonal nature of eigenvectors can be confirmed by examining their crossproducts (i.e., always zero as the cosine of 90o is always zero). The Eigenvector Matrix is also its own transposed inverse. thus multiplying the two will produce the identity matrix (I). V * V' = I. The eigenvector matrix contains the standardized regression coefficients (i.e., slopes) for a multiple linear regression equation in wich all original variables are used to explain each principal axis.
The matrix of Eigenvalues describes the magnitude of each eigenvector's size, contained within the diagonal of the matrix, and measured in variance units. Each eigenvalue can thus also be expressed in relative form as % of total Variance explained by each PC axis. The square roots of the Eigenvalues, i.e., the singular values, describes the true magnitude of the eigenvectors in the actual units defned by the data space.
Factor Pattern Matrix (i.e., component loadings, factor loadings) characterizes the actual eigenvectors in true size as they combine both information of magnitude (its singular value) and direction (its unit-length eigenvector matrix). Factor loadings are also the correlation coefficients (r) between the original variables and the factors.
Matrix of Squared Factor Pattern Matrix: expresses the percentage of variance in each variable explained by each factor (r2). The sum of the squared factor loadings for each Eigenvector adds up to the eigenvalue.
Factor Scores: Recast the data points into new axes: Use the matrix of factor loadings to obtain scores (i.e., new coordinates) for each datapoint. The sum of squared scores on a variable will add up to N-1 where N is the number of data elements. The sum of the squares of the columns of the factor score matrix is the inverse of the eigenvalue corresponding to that column.
Factor analysis does, however, not tell us how many different factors (i.e., independent axes) may be extracted in a meaningful way. Biological reasoning should provide us with arguments about how many different underlying constructs should be present. Alternatively, based on measures of variance explained by each successive axis, we may use different criteria for whether a particular axis still contains meaningful information.
Communality: the proportion of a variable's variance that is explained by a factor structure. A communality is denoted by h2. See communality measures for each original variable the proportion of its variance that it shares with other variables, i.e., variance which can be represented by communal factors; sum of the squared factor loadings for a given variable across all included factors; 1.0, or 100% if no factors are dropped. The proportion of variance that is unique to each item is the variables' total variance minus the communality