Advanced Statistics  Biology 6030

Bowling Green State University, Fall 2017

Proximity Measures
Proximities are measures where similarity is expressed in terms of a metric distance function d(i, j). Different distance functions are available for intervalscaled, boolean, categorical, ordinal and ratio variables. If a dataset contains measures on multiple characteristics for multiple individuals then a matrix of proximities can be obtained across individuals using the various measures, or across measures using the individuals.
examine similarity matrix  look for minimum neighbour, cluster analysis, etc.
Uses
 measure how similar (or different) objects are across a set of characteristics
Examples
 How similar are different skulls based on morphological characteristics?
 How similar are environmental conditions in different habitats based on the cooccurrence of species?
How this is done
 Construct a similarity matrix between a set of items: proximity between two objects is a number indicating how similar (i.e., similarity matrix) or how different (i.e., dissimilarity or distance matrix) the two objects are. Geographic distances between locations on a map measure dissimilarity  a larger value of the measure indicates a greater level of distance
 Continuous Data examines covariation in variables describing the cases. ztransform all data where all columns have a mean of zero and a standard deviation of 1
 Correlation Matrix
 Euclidian Distance
 Squared Euclidian Distance D^{2} = (x1_{1}x1_{2})^{2} + (x2_{1}x2_{2})^{2} + ... + (xn_{1}xn_{2})^{2}
 Pattern Similarity Matrix Σ(x_{i}y_{i})/SQRT(Σx_{i}^{2} Σy_{i}^{2})
 Manhattan Distance Matrix
 Mahalanobis Distance Matrix: Distances that take into account the correlations in your data
 Ordinal Data convert to ranks, scale to [0 .. 1], then treat as continuous
 Hamming Distance: this measures the minimum number of changes required to change one item into another (e.g., mutations in gene sequence)
 Ratio Data apply logarithmic transformation y = log(x), convert to ranks, map to [0 .. 1], then treat as continuous
 Binary Data examine joint presence or absence. Both outcomes may be of equal value for calculating a distance (e.g., male vs. female) or they may be different (e.g., presence or absence of a disease). In the first case you may want to treat both matches the same, in the second you may place more value on matched presence than matched absence.
 Simple Matching Matrix: equal weight for matches and nonmatches with joint absences included SimpleMatch = (a+d) / (a+b+c+d)
 Russel & Rao Matrix: equal weight for matches and nonmatches with joint absences included in denominator only. RusselRao = a / (a+b+c+d)
 Sokal & Sneath 1: double weight for matches with joint absences included. SS1 = 2(a+d) / (2(a+d)+b+c)
 Rogers & Tanimoto: double weight for nonmatches with joint absences included. Rogers = (a+d)/ (a+d+2(b+c))
 Jaccards Similarity Matrix (i.e, similarity ratio): equal weight for matches and nonmatches with joint absences excluded. Jaccard = a / (a+b+c)
 Dice Matrix: double weight for matches with joint absences excluded. Dice = 2a / (2a+b+c)
 Fourfold Point Correlation Matrix: this is the binary form of the product moment correlation coefficient. Phi = (a*d  b*c) / SQRT((a+b)(a+c)(b+d)(c+d))
 Various Types of Data, e.g. strings, gene sequences, orders of events
 Hamming Distance: this measures the minimum number of changes required to change one item into another (e.g., mutations in gene sequence)
last modified: 3/26/14
This material is copyrighted and MAY NOT be used for commercial purposes, © 20012017 lobsterman.
[ Advanced Statistics Course page  About BIO 6030  Announcements ]
[ Course syllabus  Exams & Grading  Glossary  Evaluations  Links ]