## Analysis of associations between variables: Linear Regression Analysis

Linear Regression Analysis allows us to evaluate whether and to what degree a dependent variable (Y) is explained by an independent variable (X). Towards this goal we examine the data from multiple cases for which both measures are available. Specifically we test whether knowing a value of X for a given case will provide us with a reasonable estimate for its corresponding Y. We first empirically obtain the line that best fits our data set, and we then test whether our data associate significantly (i.e. non-randomly) with this line.

#### Uses

• characterize the relationship between a dependent and an independent variable to determine the extent, direction, and strength of the association
• seek a quantitative formula that predicts the value in a dependent variable as a function of the independent variable
• control for a variable that is suspected of impacting another relationship
• determine which of several independent variables is most suited for describing and predicting values of a dependent variables
• residuals: one person's error is somebody else's treasure

#### Interpretation

• Caveat: causality is the assumption not the conclusion

#### How this is done

• independent variable: i.e., X, predictor variable may be a specified, controlled predictor variable or an unspecified, observational predictor
• dependent variable: i.e., Y, response variable.
• we obtain n pairs of measures for an X and a Y variable
• independence: each pair is obtained from a different individual
• forward method: first test the fit of a simple model based on a straight line, then check whether more complex models significantly improve the fit
• Linear Models: y-intercept and slope, mathematical vs. statistical model. It is called linear not because the common model is a line, but because population parameters are additive.
• Least Squares Method for finding the best-fitting straight line: For each given line you can estimate a predicted y value ( ) for any given x and then sum the squared differences for each S(Y- )2. The line with the smallest sum of squares is the one with the best fit.
• Fortunately the constants a and b can be obtained nicely using the worksheet for Regression Analysis. Understand why the ratio of the Sum of cross products (x,y) / sum of squares (x) represents the slope of the best fitting line.
• As Y is a random variable, we cannot get an exact Y value for a specific X. E - Error describes how far all individual responses are from the population regression line. yi = a + bxi + ei
• Obtain confidence intervals for the slope to see whether a horizontal line is included

• #### Additional Terms derived from the ANOVA Tables

Considering an ANOVA table for a Regression analysis, understand and develop an intuitive feeling for the derivation and meaning of all terms listed below:

• Coefficient of Determination (r2) is the proportion of the total variation in one variable that is explained by the other variable, : r2 = SSRegression / SSTotal; *100 = %
• Adjusted Coefficient of Determination (r2): r2 = 1 - (MSError / MSTotal)

Worksheet: Regression

 In R you would first import datafile "DummyData.txt", then you create the model for the linear regression, then report the ANOVA Table and results > dummy <- read.table("http://caspar.bgsu.edu/~courses/Stats/Labs/Datasets/DummyData.txt", header=TRUE) > dummy.lm <- lm(dummy\$Brightness~dummy\$Size) > summary(dummy.lm) > anova(dummy.lm)