Simplify large data sets in kinesiology using factor analysis
This is an excerpt from Statistics in Kinesiology-4th Edition by William Vincent & Joseph Weir.
When data are collected on many variables using the same set of subjects, it is possible, indeed probable, that some of the variables are related. For example, if we want to determine the overall motor skill and fitness components of a group of people, we might collect data on the following variables: chin-ups, curl-ups in 1 minute, bench presses, leg presses, 50-yard dash, 440-yard dash, 1-mile run, sit-and-reach performance, skinfolds, softball throw for distance, softball throw for accuracy, standing long jump, vertical jump, hip flexibility with a goniometer, push-ups, step test for aerobic capacity, VO2max on a treadmill, body mass index, shuttle run, stork stand for balance, Bass stick test, reaction time, and perhaps many others.
You may immediately think that we don't need to use all of these tests because some of them test the same component of skill or fitness. What you really mean is that correlations exist between or among many of the measured variables. When two items are highly correlated, we can surmise that they essentially measure the same factor. For example, the sit-and-reach test and the hip flexibility with a goniometer test would be expected to be highly related to each other because the sit-and-reach test measures hip and low-back flexibility and the goniometer test measures hip flexibility. The common factor of hip flexibility produces a correlation between the variables. Likewise, bench presses, push-ups, and chin-ups are all measures of upper body strength but in slightly different ways. The 1-mile run and VO2max on a treadmill are also highly related in that they both measure factors related to aerobic capacity.
One could discover these relationships by producing an intercorrelation matrix among all the variables and looking for Pearson r values in excess of some predetermined level; for example, ± 80. This may work fairly well with a few variables, but when many variables are measured the intercorrelation matrix grows rapidly and soon becomes too cumbersome to deal with
To resolve this issue, a statistical technique called factor analysis, which identifies the common components among multiple dependent variables and labels that component as a factor, has been developed. The factor is derived (virtual); that is, it does not actually exist as a stand-alone measured value. It is simply the result of the common component between or among two or more actually measured variables. The factor has no name until the researcher gives it one. For example, the relationship among bench presses, push-ups, and chin-ups produces a factor that the researcher might label upper body strength.
Notice that we have not actually measured upper body strength; we have measured the number of bench presses, push-ups, and chin-ups a person can do. To determine the upper body strength of the subjects, we do not need to measure all three variables. Because the correlation among the variables is high (e.g., r > .80), if a person can do many bench presses, we can predict that he or she will also be able to do many chin-ups and push-ups. Hence, we need to measure only one of the variables to estimate the subject's upper body strength. Using this technique, researchers have identified the most common components of physical fitness: muscular strength, muscular endurance, aerobic capacity, body composition, and flexibility.
The same argument could be used for measures of speed. Perhaps we would find a correlation among the 50-meter dash, the 100-meter dash, and the shuttle run. The common factor here is sprinting speed. However, because each variable has unique characteristics (e.g., the shuttle run also requires the ability to stop and change directions quickly), the correlation among them is not perfect. But a common factor exists among them that we could label sprint
Factor analysis is therefore simply a method of reducing a large data set to its common components. It removes the redundancy among the measured variables and produces a set of derived variables called factors. We address these ideas in our study of multiple regression. To illustrate, examine figure 18.1. Variable Y shares common variance with X1, X2, X3, and X4 but very little with X5. Variables X3 and X5 are highly correlated (collinear) with each other, so they contain a common factor. Variable X5 is redundant and does not need to be measured because X3 can represent it in their common relationship with Y.
Computer programs that perform factor analysis list the original variables that group together into common derived factors. These factors are then given a number called an eigenvalue. The eigenvalue indicates the number of original variables that are associated with that factor (Kachigan, 1986, p. 246). If we are dealing with 10 original variables, then each one contributes an average of 10% of the total variance in the problem. If factor 1 had an eigenvalue of 4.26, then that factor would represent the same amount of variance as 4.26 of the original variables. Because each variable contributes an average of 10%, factor 1 would contribute 4.26 × 10 or 42.6% of the total variance.
The more variables that load on a given factor (i.e., its eigenvalue), the more important that factor is in determining the total variance of the entire set of data. By comparing eigenvalues, we can determine the relative contribution of each factor. This method usually identifies the two or three most important factors in a data set. Because the goal of factor analysis is to reduce the number of variables to be measured by grouping them into factors, we are looking for the smallest number of factors that will adequately explain the data. Generally speaking, when the eigenvalue for a factor decreases below the average variance explained by the original variables (in our example, 10%), it is not used. It is rare to identify more than three or four factors.
After the factors are identified, the computer can perform a procedure called factor rotation. This is a mathematical technique in which the factors are theoretically rotated in three-dimensional space in an attempt to maximize the correlation among the original variables and the factor on which they load. It also attempts to reduce the relationships among the factors. After all, if two factors are highly correlated to each other, we do not need both of them. Orthogonal rotation is the method in which the computer attempts to adjust or rotate the factors that have been selected as most representative of the data (i.e., the ones with the highest eigenvalues) so that the correlation among the factors is orthogonal (as close to zero as possible) and the correlation between the variables that load on a factor and the factor is maximized.
Assumptions and Cautions
Sample size is critical in factor analysis. Tabachnick and Fidell (1996, p. 579) suggest that problems may occur with too few cases. Sample size is less critical when the factors are independent and distinct. Variables should be reasonably normal; one should check them with a skewness and kurtosis test. Factor analysis assumes that the correlation among the variables is linear. If any correlation is curvilinear, the analysis will be less effective. Transformations of data can sometimes reduce curvilinearity. Multicollinearity or singularity is usually not a problem because we are actually trying to identify these characteristics of the data. If two variables are perfectly related (singular), both are not needed; one can be eliminated. Finally, outliers must be identified and eliminated or transformed. Outliers reduce the ability of the computer to find the appropriate factors.
Read more from Statistics in Kinesiology, Fourth Edition, by William Vincent and Joseph Weir.More Excerpts From Statistics in Kinesiology 4th Edition
Get the latest insights with regular newsletters, plus periodic product information and special insider offers.