## Cluster Analysis

Cluster Analysis is a set of methods for grouping objects into different categories based on similarities. As a descriptive technique it discovers structures in data without the need for an explaination as to why they exist.

#### Uses

• identify groups of similar cases based on a set attributes
• classify cases into relatively homogeneous groups

### Example

• cluster multiple individuals of one species from different localities to determine population membership
• include all variables that are believed to be significant for characterizing the cases
• compute measures of similarity among a number of cases. Note that the inverse of similarity is distance
• plot clusters as a function of coefficient of similarity: dendrogram, icicle plot

### How this is done

• standardize variables that are measured in different units
• based on this information group cases into clusters using one of several clustering algorithms
• terminate the formation of clusters when all objects are included in one big cluster (agglomerative) or have been split into individual cases (divise)
• A variety of different distance/similarity measures can be used. For the following pair of cases consider advantages and disadvantages of different methods
• Distance Measures
• Euclidian Distance
• Squared Euclidian Distance D2 = (x11-x12)2 + (x21-x22)2 + ... + (xn1-xn2)2
• standardized squared Euclidian Distance
• Manhattan Distance
• Hamming Distance
• Mahalanobis Distance
• Inner Product Measure using the angle between two vectors
• Clustering algorithms
• hierarchical clustering: agglomerative or divise
• criteria for combining/dividing clusters: