Principal Components Analysis (PCA) is often used for dimensionality reduction when dealing with high-dimensional data. Here, we will look at a subset of the drug sensitivity data, from the GDSC study. Each cell line will be treated as an independent observation with multiple features corresponding to the AUC values for the interaction between the cell line and each drug in the study.
To start, let’s load the data from the GDSC study and format it into a data.frame
of AUC values.
sumData <- readRDS(file.path("..", "data", "summarizedPharmacoData.rds"))
auc_GDSC <- spread(sumData[,c(1,2,6)], drug, auc_GDSC)
rownames(auc_GDSC) <- auc_GDSC$cellLine
auc_GDSC$cellLine <- NULL
auc_GDSC <- auc_GDSC[!is.na(rowSums(auc_GDSC)), ]
head(auc_GDSC)
## 17-AAG AZD0530 AZD6244 Crizotinib Erlotinib lapatinib Nilotinib
## 697 0.064947 0.045913 0.006099 0.057744 0.077949 0.012493 0.069265
## A253 0.234611 0.023700 0.002219 0.004421 0.173074 0.189943 0.002092
## BL-41 0.062905 0.006885 0.001980 0.054177 0.075726 0.017029 0.010249
## BT-474 0.254544 0.025552 0.007333 0.002215 0.013753 0.371971 0.002457
## C2BBe1 0.280216 0.139848 0.314262 0.005379 0.060791 0.013057 0.030435
## CAS-1 0.085450 0.007279 0.004754 0.012538 0.007571 0.007271 0.003844
## Nutlin-3 paclitaxel PD-0325901 PD-0332991 PHA-665752 PLX4720
## 697 0.237522 0.655135 0.049681 0.492333 0.026027 0.196415
## A253 0.003647 0.319748 0.117657 0.046008 0.006456 0.003951
## BL-41 0.001980 0.396726 0.002178 0.141175 0.010968 0.009480
## BT-474 0.003816 0.108941 0.012026 0.009203 0.003945 0.002464
## C2BBe1 0.124189 0.086339 0.352049 0.068628 0.003077 0.016842
## CAS-1 0.003776 0.122153 0.017412 0.035815 0.003643 0.011026
## Sorafenib TAE684
## 697 0.132290 0.260410
## A253 0.032823 0.023700
## BL-41 0.012154 0.139509
## BT-474 0.004357 0.006060
## C2BBe1 0.032868 0.165815
## CAS-1 0.010380 0.136932
As we can see, we now have a data.frame
of AUC values with rows corresponding to cell lines and columns corresponding to drugs.
We are now ready to perform PCA, via the prcomp
function. This function computes all of the relevant information and returns it as a list
, which we will name pca
. Note: this function expects the independent observations to be in the rows of the input.
pca <- prcomp(auc_GDSC)
Now that the calculation is done, we can start to visualize the results. A reasonable starting point is to plot the top 2 principal components (PCs) and see what structure (if any) we can observe in the data.
plot(pca$x, asp = 1)
For a more detailed picture, we may want to look at a larger number of PCs. We can use the pairs
function to plot larger numbers of dimensions against each other.
pairs(pca$x[,1:5], asp = 1)