## Analysis of Variance (ANOVA)

#### Uses

• characterize the relationship between a dependent (continuous) and an independent (nominal or ordinal) variable to determine the existence and strength of the association
• characterize difference between sample means to test whether they could have been drafted from the same underlying distribution

#### Parametric methods for the comparison of sample means

• ANOVA: F = #### How this is done

Consider the situation where you wish to compare a series of k samples, each containing n values. You are hoping to statistically evaluate whether these samples could have all been derived from the same underlying distribution or whether this scenario is unlikely. You specifically test the Ho: µ1 = µ2 = µ3 ... = µk. To test the null hypothesis that k population means are equal, we will compare two different estimates of variance: one based on the variation of individual data points around their individual sample means [s2(within)], and the other based on an estimate of variance among sample means [s2(between)]. The logic behind this is that s2(within) is always an estimate of the true s2 (assuming that the samples have equal variance). In contrast, s2(between) is only an estimate of the true s2 if your Ho is correct. If we thus calculate the ratio of s2(between)/s2(within) then this value should be close to one under the null hypothesis. We can reject the null hypothesis if this ratio is particularly high, indicating that the variance estimate derived among the means is disproportionately large compared to that derived from the individual data points around their individual sample means.

Step-by-step: Note that when you collect small data sets from the same underlying, normal distribution, the means of these samples will all vary slightly due to chance differences in the actual values sampled. Also variances from different samples will differ from each other due to chance alone. As data points are normally distributed around their sample means with a given variance s2=S(Yi- )/n-1, so the sample means will be normally distributed around a mean of means with a given standard error SE = s/ Synonyms: Note that SSbetween is also referred to as SSmodel or SSregression. SSwithin is the same as SSresidual or SSerror.

### Assumptions

Parametric Technique: homoscedasticity, normality or large N
• Independence of datapoints
• Normality and <Central Limit Theorem> Distribution of sample means approaches a normal probability distribution as sample size increases, regardless of the shape of the population from which items are sampled. A sample size of 30 is often regarded as sufficient to employ the central limit theorem
• Homoscedasticity: Homogeneity of variances

#### Additional Terms derived from the ANOVA Tables

Considering an ANOVA table, understand and develop an intuitive feeling for the derivation and meaning of all terms listed below:

• Coefficient of Determination (r2) is obtained as the variance ratio explained by the model: SSM / SST. The value varies from between 0 and 1 (i.e., 0-100%)
• adjusted Coefficient of Determination (r2adj.) is often more comparable across models with different numbers of parameters: 1 - (MSE/MST)
• Correlation Coefficient (r) non-directional measure of the association: SQRT(SSM / SST)
• Standard Deviation of the Residuals or Root Mean Square Error (s) estimates the standard deviation of the random error. It is used in Power analysis and post-hoc tests: SQRT(MSE)
• Raw Effect Size (d) is estimated from population values: SQRT(SSM/N)

Worksheet: ANOVA