Principal Components Analysis

Principal Components Analysis (PCA) is often used for dimensionality reduction when dealing with high-dimensional data. Here, we will look at a subset of the drug sensitivity data, from the GDSC study. Each cell line will be treated as an independent observation with multiple features corresponding to the AUC values for the interaction between the cell line and each drug in the study.

To start, letโ€™s load the data from the GDSC study and format it into a data.frame of AUC values.

sumData <- readRDS(file.path("..", "data", "summarizedPharmacoData.rds"))
auc_GDSC <- spread(sumData[,c(1,2,6)], drug, auc_GDSC)
rownames(auc_GDSC) <- auc_GDSC$cellLine
auc_GDSC$cellLine <- NULL
auc_GDSC <- auc_GDSC[!is.na(rowSums(auc_GDSC)), ]
head(auc_GDSC)
##          17-AAG  AZD0530  AZD6244 Crizotinib Erlotinib lapatinib Nilotinib
## 697    0.064947 0.045913 0.006099   0.057744  0.077949  0.012493  0.069265
## A253   0.234611 0.023700 0.002219   0.004421  0.173074  0.189943  0.002092
## BL-41  0.062905 0.006885 0.001980   0.054177  0.075726  0.017029  0.010249
## BT-474 0.254544 0.025552 0.007333   0.002215  0.013753  0.371971  0.002457
## C2BBe1 0.280216 0.139848 0.314262   0.005379  0.060791  0.013057  0.030435
## CAS-1  0.085450 0.007279 0.004754   0.012538  0.007571  0.007271  0.003844
##        Nutlin-3 paclitaxel PD-0325901 PD-0332991 PHA-665752  PLX4720
## 697    0.237522   0.655135   0.049681   0.492333   0.026027 0.196415
## A253   0.003647   0.319748   0.117657   0.046008   0.006456 0.003951
## BL-41  0.001980   0.396726   0.002178   0.141175   0.010968 0.009480
## BT-474 0.003816   0.108941   0.012026   0.009203   0.003945 0.002464
## C2BBe1 0.124189   0.086339   0.352049   0.068628   0.003077 0.016842
## CAS-1  0.003776   0.122153   0.017412   0.035815   0.003643 0.011026
##        Sorafenib   TAE684
## 697     0.132290 0.260410
## A253    0.032823 0.023700
## BL-41   0.012154 0.139509
## BT-474  0.004357 0.006060
## C2BBe1  0.032868 0.165815
## CAS-1   0.010380 0.136932

As we can see, we now have a data.frame of AUC values with rows corresponding to cell lines and columns corresponding to drugs.

We are now ready to perform PCA, via the prcomp function. This function computes all of the relevant information and returns it as a list, which we will name pca. Note: this function expects the independent observations to be in the rows of the input.

pca <- prcomp(auc_GDSC)

Now that the calculation is done, we can start to visualize the results. A reasonable starting point is to plot the top 2 principal components (PCs) and see what structure (if any) we can observe in the data.

plot(pca$x, asp = 1)

For a more detailed picture, we may want to look at a larger number of PCs. We can use the pairs function to plot larger numbers of dimensions against each other.

pairs(pca$x[,1:5], asp = 1)