Advanced Statistics - Biology 6030

Bowling Green State University, Fall 2019

Cluster Analysis

Cluster Analysis is a set of methods for grouping objects into different categories based on similarities. As a descriptive technique it discovers structures in data without the need for an explaination as to why they exist.



How this is done

In R import datafile "BodyMeasures.txt" and decide how many distinct clusters seem to be present in your data. Standardize the data, then plot the sums of squares within for different numbers of groups. Similar to a screeplot, check at what number of clusters your SSwithin levels out. Based on this, five clusters looks like a useful number to extracts.

> bodyMeasures <- read.table("/BodyMeasures.txt", header=TRUE, sep=",", na.strings="NA", dec=".")
> bodyMeasures2 <- bodyMeasures[2:12]
> zbodyMeasures2 <- scale(bodyMeasures2)
> wss <- (nrow(zbodyMeasures2)-1)*sum(apply(zbodyMeasures2,2,var))
> for (i in 2:15) wss[i] <- sum(kmeans(zbodyMeasures2,centers=i)$withinss)
> plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

Use a kMeans clustering procedure, then get the cluster means and append them to the data set

> fit <- kmeans(zbodyMeasures2, 5) # 5 cluster solution
> aggregate(zbodyMeasures2,by=list(fit$cluster),FUN=mean)
> zbodyMeasures2 <- data.frame(zbodyMeasures2, fit$cluster)

Create a hierarchical conglommeration using Ward's method. First calculate a specific distance matrix, then obtain the clusters and list number of items in each cluster

> distMat <- dist(zbodyMeasures2, method = "euclidean")
> fit <- hclust(distMat, method="ward")
> plot(fit)
> groups <- cutree(fit, k=5)
> table(groups)

last modified: 03/27/14
This material is copyrighted and MAY NOT be used for commercial purposes, 2001-2019 lobsterman.
[ Advanced Statistics Course page | About BIO 6030 | Announcements ]
[ Course syllabus | Exams & Grading | Glossary | Evaluations | Links ]