Analysis of associations between variables: Linear Regression Analysis
Linear Regression Analysis allows us to evaluate whether and to what degree a dependent variable (Y) is explained by an independent variable (X). Towards this goal we examine the data from multiple cases for which both measures are available. Specifically we test whether knowing a value of X for a given case will provide us with a reasonable estimate for its corresponding Y. We first empirically obtain the line that best fits our data set, and we then test whether our data associate significantly (i.e. non-randomly) with this line.
Uses
characterize the relationship between a dependent and an independent variable to determine the extent, direction, and strength of the association
seek a quantitative formula that predicts the value in a dependent variable as a function of the independent variable
control for a variable that is suspected of impacting another relationship
determine which of several independent variables is most suited for describing and predicting values of a dependent variables
residuals: one person's error is somebody else's treasure
Interpretation
Caveat: causality is the assumption not the conclusion
How this is done
independent variable: i.e., X, predictor variable may be a specified, controlled predictor variable or an unspecified, observational predictor
dependent variable: i.e., Y, response variable.
we obtain n pairs of measures for an X and a Y variable
independence: each pair is obtained from a different individual
forward method: first test the fit of a simple model based on a straight line, then check whether more complex models significantly improve the fit
Linear Models: y-intercept and slope, mathematical vs. statistical model. It is called linear not because the common model is a line, but because population parameters are additive.
Least Squares Method for finding the best-fitting straight line: For each given line you can estimate a predicted y value () for any given x and then sum the squared differences for each S(Y-)2. The line with the smallest sum of squares is the one with the best fit.
Fortunately the constants a and b can be obtained nicely using the following worksheet for Regression Analysis. Understand why the ratio of the Sum of cross products (x,y) / sum of squares (x) represents the slope of the best fitting line.
As Y is a random variable, we cannot get an exact Y value for a specific X. E - Error describes how far all individual responses are from the population regression line. yi = a + bxi + ei
Obtain confidence intervals for the slope to see whether a horizontal line is included
Considering an ANOVA table for a Regression analysis, understand and develop an intuitive feeling for the derivation and meaning of all terms listed below:
Coefficient of Determination (r2) is the proportion of the total variation in one variable that is explained by the other variable, : r2 = SSRegression / SSTotal; *100 = %