## Proximity Measures

Proximities are measures where similarity is expressed in terms of a metric distance function d(i, j). Different distance functions are available for interval-scaled, boolean, categorical, ordinal and ratio variables. If a dataset contains measures on multiple characteristics for multiple individuals then a matrix of proximities can be obtained across individuals using the various measures, or across measures using the individuals.

examine similarity matrix - look for minimum neighbour, cluster analysis, etc.

#### Uses

• measure how similar (or different) objects are across a set of characteristics

#### Examples

• How similar are different skulls based on morphological characteristics?
• How similar are environmental conditions in different habitats based on the co-occurrence of species?

### How this is done

• Construct a similarity matrix between a set of items: proximity between two objects is a number indicating how similar (i.e., similarity matrix) or how different (i.e., dissimilarity or distance matrix) the two objects are. Geographic distances between locations on a map measure dissimilarity - a larger value of the measure indicates a greater level of distance
• Continuous Data examines co-variation in variables describing the cases. z-transform all data where all columns have a mean of zero and a standard deviation of 1
• Correlation Matrix
• Euclidian Distance
• Squared Euclidian Distance D2 = (x11-x12)2 + (x21-x22)2 + ... + (xn1-xn2)2
• Pattern Similarity Matrix Σ(xiyi)/SQRT(Σxi2 Σyi2)
• Manhattan Distance Matrix
• Mahalanobis Distance Matrix: Distances that take into account the correlations in your data
• Ordinal Data convert to ranks, scale to [0 .. 1], then treat as continuous
• Hamming Distance: this measures the minimum number of changes required to change one item into another (e.g., mutations in gene sequence)
• Ratio Data apply logarithmic transformation y = log(x), convert to ranks, map to [0 .. 1], then treat as continuous
• Binary Data examine joint presence or absence. Both outcomes may be of equal value for calculating a distance (e.g., male vs. female) or they may be different (e.g., presence or absence of a disease). In the first case you may want to treat both matches the same, in the second you may place more value on matched presence than matched absence.
 Case 2 present absent Case 1 present a b absent c d
• Simple Matching Matrix: equal weight for matches and non-matches with joint absences included SimpleMatch = (a+d) / (a+b+c+d)
• Russel & Rao Matrix: equal weight for matches and non-matches with joint absences included in denominator only. RusselRao = a / (a+b+c+d)
• Sokal & Sneath 1: double weight for matches with joint absences included. SS1 = 2(a+d) / (2(a+d)+b+c)
• Rogers & Tanimoto: double weight for non-matches with joint absences included. Rogers = (a+d)/ (a+d+2(b+c))
• Jaccards Similarity Matrix (i.e, similarity ratio): equal weight for matches and non-matches with joint absences excluded. Jaccard = a / (a+b+c)
• Dice Matrix: double weight for matches with joint absences excluded. Dice = 2a / (2a+b+c)
• Fourfold Point Correlation Matrix: this is the binary form of the product moment correlation coefficient. Phi = (a*d - b*c) / SQRT((a+b)(a+c)(b+d)(c+d))
• Various Types of Data, e.g. strings, gene sequences, orders of events
• Hamming Distance: this measures the minimum number of changes required to change one item into another (e.g., mutations in gene sequence)