Lectures for Advanced Statistics

Cluster Analysis

In R import datafile "BodyMeasures.txt" and decide how many distinct clusters seem to be present in your data. Standardize the data, then plot the sums of squares within for different numbers of groups. Similar to a screeplot, check at what number of clusters your SSwithin levels out. Based on this, five clusters looks like a useful number to extracts.

> bodyMeasures <- read.table("/BodyMeasures.txt", header=TRUE, sep=",", na.strings="NA", dec=".")
> bodyMeasures2 <- bodyMeasures[2:12]
> zbodyMeasures2 <- scale(bodyMeasures2)
> wss <- (nrow(zbodyMeasures2)-1)*sum(apply(zbodyMeasures2,2,var))
> for (i in 2:15) wss[i] <- sum(kmeans(zbodyMeasures2,centers=i)$withinss)
> plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

Use a kMeans clustering procedure, then get the cluster means and append them to the data set

> fit <- kmeans(zbodyMeasures2, 5) # 5 cluster solution
> aggregate(zbodyMeasures2,by=list(fit$cluster),FUN=mean)
> zbodyMeasures2 <- data.frame(zbodyMeasures2, fit$cluster)

Create a hierarchical conglommeration using Ward's method. First calculate a specific distance matrix, then obtain the clusters and list number of items in each cluster

> distMat <- dist(zbodyMeasures2, method = "euclidean")
> fit <- hclust(distMat, method="ward")
> plot(fit)
> groups <- cutree(fit, k=5)
> table(groups)

last modified: 03/27/14