Supplement: Exploring High Dimensional Data with PCA and Clustering

Principal Components Analysis

Principal Components Analysis (PCA) is often used for dimensionality reduction when dealing with high-dimensional data. Here, we will look at a subset of the drug sensitivity data, from the GDSC study. Each cell line will be treated as an independent observation with multiple features corresponding to the AUC values for the interaction between the cell line and each drug in the study.

To start, let’s load the data from the GDSC study and format it into a data.frame of AUC values.

sumData <- readRDS(file.path("..", "data", "summarizedPharmacoData.rds"))
auc_GDSC <- spread(sumData[,c(1,2,6)], drug, auc_GDSC)
rownames(auc_GDSC) <- auc_GDSC$cellLine
auc_GDSC$cellLine <- NULL
auc_GDSC <- auc_GDSC[!is.na(rowSums(auc_GDSC)), ]
head(auc_GDSC)

##          17-AAG  AZD0530  AZD6244 Crizotinib Erlotinib lapatinib Nilotinib
## 697    0.064947 0.045913 0.006099   0.057744  0.077949  0.012493  0.069265
## A253   0.234611 0.023700 0.002219   0.004421  0.173074  0.189943  0.002092
## BL-41  0.062905 0.006885 0.001980   0.054177  0.075726  0.017029  0.010249
## BT-474 0.254544 0.025552 0.007333   0.002215  0.013753  0.371971  0.002457
## C2BBe1 0.280216 0.139848 0.314262   0.005379  0.060791  0.013057  0.030435
## CAS-1  0.085450 0.007279 0.004754   0.012538  0.007571  0.007271  0.003844
##        Nutlin-3 paclitaxel PD-0325901 PD-0332991 PHA-665752  PLX4720
## 697    0.237522   0.655135   0.049681   0.492333   0.026027 0.196415
## A253   0.003647   0.319748   0.117657   0.046008   0.006456 0.003951
## BL-41  0.001980   0.396726   0.002178   0.141175   0.010968 0.009480
## BT-474 0.003816   0.108941   0.012026   0.009203   0.003945 0.002464
## C2BBe1 0.124189   0.086339   0.352049   0.068628   0.003077 0.016842
## CAS-1  0.003776   0.122153   0.017412   0.035815   0.003643 0.011026
##        Sorafenib   TAE684
## 697     0.132290 0.260410
## A253    0.032823 0.023700
## BL-41   0.012154 0.139509
## BT-474  0.004357 0.006060
## C2BBe1  0.032868 0.165815
## CAS-1   0.010380 0.136932

As we can see, we now have a data.frame of AUC values with rows corresponding to cell lines and columns corresponding to drugs.

We are now ready to perform PCA, via the prcomp function. This function computes all of the relevant information and returns it as a list, which we will name pca. Note: this function expects the independent observations to be in the rows of the input.

pca <- prcomp(auc_GDSC)

Now that the calculation is done, we can start to visualize the results. A reasonable starting point is to plot the top 2 principal components (PCs) and see what structure (if any) we can observe in the data.

plot(pca$x, asp = 1)

For a more detailed picture, we may want to look at a larger number of PCs. We can use the pairs function to plot larger numbers of dimensions against each other.

pairs(pca$x[,1:5], asp = 1)

Another important consideration is the relative importance of the PCs. Calling the plot function directly on the output of prcomp produces a barplot of the variances of the PCs, which is one way of quantifying the amount of information they contain. Notice that the height of the bars is strictly decreasing; this will always be the case for PCA.

plot(pca)

Since the first PC seems to contain a fair amount of information, we may also be interested in examining which of the drugs are most related to it. We can do this by plotting the loadings of each drug on PC-1, which are stored in the rotation matrix.

barplot(pca$rotation[,1], las = 2)

Clustering

Another question we may be interested in answering is whether or not there are distinct groups of cell lines, perhaps based on cancer type or tissue of origin. With no additional information on the cell lines, we can seek to discover such groups via clustering.

k-means

Below, we apply k-means clustering to the same subset of the data as above.

km <- kmeans(auc_GDSC, centers = 3)

And we visualize the resulting clusters on the top two PCs:

plot(pca$x, asp = 1, col = km$cluster)

Alternatively, we could have applied our clustering algorithm to a reduced representation of the data. Here, we will apply k-means to the top 3 PCs of the data and again visualize the results using the top 2 PCs.

km2 <- kmeans(pca$x[,1:3], centers = 3)
plot(pca$x, asp = 1, col = km2$cluster)

Why do these results differ from the clusters obtained on the full dataset? (Hint: look at the pairs plot).

Hierarchical clustering

We will try one more clustering method, this time using the hclust function to perform hierarchical clustering.

hc <- hclust(dist(auc_GDSC))
lab <- cutree(hc, k = 3)

plot(pca$x, asp = 1, col = lab)

As with k-means, we may also choose to cluster the low-dimensional representation of our data rather than the full dataset:

hc <- hclust(dist(pca$x[,1:3]))
lab <- cutree(hc, k = 3)

plot(pca$x, asp = 1, col = lab)