Advanced Statistics - Biology 603
|
Bowling Green State University, Spring 2008
|
Review of Basics in Statistics
The Normal Distribution
Family of Distributions with the same general shape (i.e., Gaussian distributions, bell curves) which feature a single peak at the precise center of the distribution, are symmetrically concentrated in the middle and decrease the further you go into the tails. They reflect a binomial distribution at very large sample sizes. Normal distributions are centered on the population mean (m) and dispersed around it with a given population variance (s2). A Standard Normal Distribution is one that has the following parameters (m=0, s2=1, g1=0, g2=0). Any general normal distribution can be converted to a Standard Normal Distribution (m=0, s2=1) using the Standard Normal Deviate (
) or z-Score.
- Measures of Central Tendency
- Mean (arithmetic, 1st moment): sum of values divided by the number of values
- Median: midpoint of values after they have been arranged from highest to lowest (i.e., 50th percentile)
- Mode: midpoint of class interval with largest frequency (i.e., most common value and thus the highest point in a distribution). Suitable for all types of data including nominal but examine data for bi- or multimodality
- Measures of Dispersion
- Sum of squares
- Mean squares (Variance)
- Standard Deviation (2nd moment)
- Range and Interquartile Range (i.e., difference between the 75th and 25th percentile is an underutilized, stable measure of dispersion)
- Measures of Asymmetry
- Skewness (g1, 3rd moment): Exactly one half of all measures lies above teh mean, the other half is below. positive, negative skewness. Be concerned about skewness and kurtosis values >1 or <-1.
- Measures of Peakedness
- Kurtosis (g2, 4th moment): leptokurtic - high-peaked; mesokurtic - normal; platykurtic - flat-topped
- z-Tables list the area under the probability density function for a standard normal distribution
- Central limit theorem: explains why many distributions tend towards normality when the random variable being observed is the sum or mean of many independent identically distributed random variables.
Sampling
- <Population>: complete set of individuals having some common characteristic (e.g., all Australians). In many cases it will be impossible or at least unfeasable to obtain a true population <Parameter> through a measurment obtained for every member of the population. In this case one may estimate the particular characteristic of interest for the underlying population from a sample of items drawn from the population
- The sample should be representative of the population but it is unreasonable to expect that the population characteristic will be matched exactly by sample characteristic (statistic). <Statistic>: measurable characteristic of a sample (e.g., s2) is used to estimate a: measurable characteristic of a population (e.g., s2). For example, a sample mean will differ from the underlying population mean due to chance alone. Sampling error: The difference between the obtained sample characteristics and its true corresponding population mean.
- Sampling frame: subset of the population from which the sample is actually drawn (e.g., White pages)
- Sample: the set of people included in the study (i.e., selected from the sampling frame) (e.g., every 50th person in the white pages)
- <Probability sampling>: Each member of the underlying population has a known likelihood of being included in the sample. <Non-probability sampling>: arbitrary, sample not representative of population. A sample is biased if it is not representative of the overall population
- <Random sampling>: each member of the population has an equal chance of being selected
- <Systematic random sampling>: Member of the population are linearly arranged in some fashion and a starting point is selected at random. After this every kth element is chosen for the sample
- <Stratified random sampling>: A population is divided into logical subgroups and random samples are drawn from each subgroup (e.g., random sampling within each state)
- <Cluster sampling>: identify ‘clusters’ of individuals & sample from these, or <Multi-stage cluster sampling>: e.g., 1 person per selected household per selected suburb
- <Quota sampling>: (e.g., 50% psychology students, 30% economics students, 20% law students)
- <Convenience sampling>: “take them where you find them” method e.g., at shopping mall
- <Snowball sampling>: ask each respondent if they know someone else suitable for survey (e.g., studying drug-users)
Normal Distribution
-
- If multiple samples are obtained from a population their means will generally be normally distributed around the true underlying population mean (m). According to the Central limit theorem they will be distributed normally around it. Moreover, this is true regardless of the shape of the population from which items are sampled. The distribution of sample means approaches a normal probability distribution when sample size is sufficiently large (N >= 30).
- Standard error of the mean: Standard deviation for multiple sample means drawn from a particular population
- Confidence intervals: Range within which the population parameter is expected to fall for a given level of confidence
- individual measures (e.g., 95% µ ± 1.96s; 99% µ ± 2.58s)
- sample means (individual confidence intervals / SQRT(N); e.g., N=100; µ ± 0.196s; 99% µ ± 0.258s)
- Additional graphics
- Estimated Probability Density Function
- Quantile Box Plot
- Normal Quantile Plot
-
last modified: 01/11/05
This material is copyrighted and MAY NOT be used for commercial purposes, © 2001-2008 lobsterman.
[ Advanced Statistics Course page | About BIO 603 | Announcements ]
[ Course syllabus | Exams & Grading | Glossary | Evaluations | Links ]