Advanced Statistics - Biology 6030
|
Bowling Green State University, Fall 2019
|
Proximity Measures
Proximities are measures where similarity is expressed in terms of a metric distance function d(i, j). Different distance functions are available for interval-scaled, boolean, categorical, ordinal and ratio variables. If a dataset contains measures on multiple characteristics for multiple individuals then a matrix of proximities can be obtained across individuals using the various measures, or across measures using the individuals.
examine similarity matrix - look for minimum neighbour, cluster analysis, etc.
Uses
- measure how similar (or different) objects are across a set of characteristics
Examples
- How similar are different skulls based on morphological characteristics?
- How similar are environmental conditions in different habitats based on the co-occurrence of species?
How this is done
- Construct a similarity matrix between a set of items: proximity between two objects is a number indicating how similar (i.e., similarity matrix) or how different (i.e., dissimilarity or distance matrix) the two objects are. Geographic distances between locations on a map measure dissimilarity - a larger value of the measure indicates a greater level of distance
- Continuous Data examines co-variation in variables describing the cases. z-transform all data where all columns have a mean of zero and a standard deviation of 1
- Correlation Matrix
- Euclidian Distance
- Squared Euclidian Distance D2 = (x11-x12)2 + (x21-x22)2 + ... + (xn1-xn2)2
- Pattern Similarity Matrix Σ(xiyi)/SQRT(Σxi2 Σyi2)
- Manhattan Distance Matrix
- Mahalanobis Distance Matrix: Distances that take into account the correlations in your data
- Ordinal Data convert to ranks, scale to [0 .. 1], then treat as continuous
- Hamming Distance: this measures the minimum number of changes required to change one item into another (e.g., mutations in gene sequence)
- Ratio Data apply logarithmic transformation y = log(x), convert to ranks, map to [0 .. 1], then treat as continuous
- Binary Data examine joint presence or absence. Both outcomes may be of equal value for calculating a distance (e.g., male vs. female) or they may be different (e.g., presence or absence of a disease). In the first case you may want to treat both matches the same, in the second you may place more value on matched presence than matched absence.
- Simple Matching Matrix: equal weight for matches and non-matches with joint absences included SimpleMatch = (a+d) / (a+b+c+d)
- Russel & Rao Matrix: equal weight for matches and non-matches with joint absences included in denominator only. RusselRao = a / (a+b+c+d)
- Sokal & Sneath 1: double weight for matches with joint absences included. SS1 = 2(a+d) / (2(a+d)+b+c)
- Rogers & Tanimoto: double weight for non-matches with joint absences included. Rogers = (a+d)/ (a+d+2(b+c))
- Jaccards Similarity Matrix (i.e, similarity ratio): equal weight for matches and non-matches with joint absences excluded. Jaccard = a / (a+b+c)
- Dice Matrix: double weight for matches with joint absences excluded. Dice = 2a / (2a+b+c)
- Fourfold Point Correlation Matrix: this is the binary form of the product moment correlation coefficient. Phi = (a*d - b*c) / SQRT((a+b)(a+c)(b+d)(c+d))
- Various Types of Data, e.g. strings, gene sequences, orders of events
- Hamming Distance: this measures the minimum number of changes required to change one item into another (e.g., mutations in gene sequence)
last modified: 3/26/14
This material is copyrighted and MAY NOT be used for commercial purposes, © 2001-2019 lobsterman.
[ Advanced Statistics Course page | About BIO 6030 | Announcements ]
[ Course syllabus | Exams & Grading | Glossary | Evaluations | Links ]