## Analysis of associations between variables: Multiple Linear Regression Analysis

Multiple Linear Regression Analysis generalizes Linear Regression Analysis to include more than one independent variable. One obtains the best fitting linear, multi-variable equation to account for the existing data and, hopefully, make correct predictions on new data. As such, Multiple linear regression analysis represents a multivariate strategy rather than a bona-fide multivariate technique which is usually reseved for analyses with multiple dependent variables.

#### Uses

• develop an equation that summarizes the relationship between a dependent (i.e., criterion) and a set of independent (i.e., predictor) variables
• identify a subset of independent variables most useful for predicting a dependent variable
• predict values for a dependent variable based on a set of independent variables

### Assumptions

• Linearity
• Normality
• Homoscedasticity (i.e., the same level of dispersion throughout the range of the independent variable)
• Confirms relationships not causal mechanisms (e.g., size of fire damage and number of fire fighters)
• For stability have at least 10-20x as many observations as variables
• Matrix Ill-Conditioning: Ill-conditioned matrices produce estimated coefficients that are unstable (i.e. small changes within the range of measuring error of the variables can lead to disproportionately large changes in the estimates).
• Multi-Collinearity is a type of ill-conditioning that occurs when predictor variables show high correlations among them. in this case the importance of a given predictor is difficult to assess because excessive correlation confound the main predictor effects. It is often desirable to perform MLR on centered variables (i.e., its mean subtracted from each case of the variable) to reduce this problem.
• Many analysis depend on the situation where the variance-covariance matrix is of full rank. Problems occur when one indepedent variable is simply a linear function of another variable. This occurs when both measure the same thing and there is then obviously little sense to ask which one servers as the better predictor of Y Mathematically, such a situation produces a (correlation or VCV) matrix that is referred to as singular or of reduced column rank. Among many problems, no inverse matrix can be calculated in such a case.
• Single Outliers can severely bias regression coefficients

### How this is done

• towards this goal we obtain a variance/covariance or correlation matrix for all variables
• forward, backward or step-wise selection of terms to include/retain
• forward selection: first test the fit of a simple model with a minimum number of Y terms, then check whether more complex models significantly improve the fit, with each step enter the next most significant term and test whether it results in a significant improvement
• backward selection: first enter all terms then remove the one term that contributes the least, then test whether the fit decreases significantly
• step-wise selection: alternate forward and backward selection
• Β Coefficients (i.e., partial regression coefficients): Lists the independent contribution of each independent variable to the prediction of the dependent variable. Note: this is not a measure of importance of each variable as each depends on its correlations with other variables and its magnitude depends on the units in which it is measured
• Β Weights (i.e. standardized regression coefficients): are obtained as regression coefficients after all variables have been standardized (Z-score).
• Assess the relative importance of indepedent variables by examining to what degree the Coefficient of determination (R2) increases if this variable is entered into the equation. Recall that R2 measures how much of the variance in the dependent variable is explained by the specific terms for the independent variables. An R2 = 0.6 means that 40% are residual (unexplained) variability.
• Partial correlation coefficients: The square root of R2 lists the non-directional correlation between the independent variable and the dependent variable, when the linear effects of X have been removed from the association between Y and Z. The signs of the Β Coefficients indicate the directionality of the association. • Alternatively, Partial correlation coefficients can also be obtained by performing linear regressions of Y on X and Z on X. The residuals from these analyses represent X-free variation and a regression analysis is performed between them.
• Consider the importance of high correlations between the various Xs
• Residuals
 To perform a stepwise multiple linear regression analysis download datafile "Bodymeasures.txt", then recode variable Sex[M,F] into Sex_num[0,1] and make sure the variable is treated as numeric. You can then create a linear regression model with the variables you want, and display it > Dataset <- read.table("/BodyMeasures.txt", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE) > Dataset\$Sex_num[Dataset\$Sex=="M"] <- 0 > Dataset\$Sex_num[Dataset\$Sex=="F"] <- 1 > Dataset\$Sex_num <- as.numeric(Dataset\$Sex_num) > fit <- lm(Sex_num~Mass+Fore+Head,data=Dataset) > fit Load library MASS (which is likely already pre-installed on your system), perform stepwise model selection by exact AIC, and display the results > library(MASS) > step <- stepAIC(fit, direction="both") > step\$anova