Large-scale pharmacogenomic studies, like the CCLE and GDSC we are investigating, measure the effects of many anti-cancer drugs on various cell lines. However, the primary interest and motiviation behind these studies is not to simply determine which cell lines are most susceptible to which drugs, but rather what are the characteristics of the cell lines that are susceptible to certain drugs? The hope is that by matching these characteristics to those of patients’ tumor cells, this information can be used to create improved cancer treatment regimens for patients.

If the hope is to create improved treatment regimens in cancer patients, why would we study cell lines? We have to start in cell lines (looking at the in vitro response) when screening many drugs at multiple concentrations, since we need to be able to compare the effects of drugs and doses on cells of the same type. Cell lines can be grown continously (even in different labs) to generate comparable samples, whereas clinical patient samples are fixed amounts of cells that would be used up after the experiment.

So what are the characteristics to examine when attempting to relate the best treatment regimen with a cell line or patient? These consist of genomic factors (e.g. gene expression, copy number, mutation) as well as pharmacologic (e.g. AUC or IC50 of response to various drugs). It is also of interest to investigate whether similar drugs (similar could mean similar mechanism of action) have similar effects.

Figure from “Tumor-Derived Cell Lines as Molecular Models of Cancer Pharmacogenomics” by Andrew Goodspeed et al. (Molecular Cancer Research, 2016).

The CCLE and GDSC studies both investigated genomic profiles for all the cell lines that were used to study drug response. The genomic profiles collected included gene expression measurements for more than 10,000 genes in every cell line (measured using microarrays) and mutation status for a subset of 64 genes in every cell line (measured using targeted sequencing). We will not have time to explore this data, but it is available through the R package PharmacoGx if you wish to explore it for a future project. Some code is also included in the downloadData.R script to extract and clean up some of this data (gene expression measurements) to help you get started.

One of the main finding of Haibe-Kains et al. (2013) (the reanalysis paper) was that the genomic data of the cell lines used in both studies were highly correlated. However, based on our analysis so far, this doesn’t seem to be the case with the pharmacological data. Part of this may be explained by differences in protocol (e.g. different drug concentrations) between the studies, but it may also be due to our current approach to measuring “agreement” between the studies. In the rest of this tutorial, we’ll investigate how classifying the cell lines and drugs into subgroups might help explain some of this lack (or not!) of agreement.

Setup Workspace

We start by loading the tidyverse family of packages and setting the default ggplot2 theme to “theme_bw”.

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.2     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Load Summarized Dataset

First, we’ll read in the RDS file that contains the summarized pharmacological data (including the IC50 and AUC values for each drug and cell line combination, as described above).

summarizedData <- readRDS(file.path("..", "data", "summarizedPharmacoData.rds"))
## 'data.frame':    2557 obs. of  6 variables:
##  $ cellLine : chr  "22RV1" "5637" "639-V" "697" ...
##  $ drug     : chr  "Nilotinib" "Nilotinib" "Nilotinib" "Nilotinib" ...
##  $ ic50_CCLE: num  8 7.48 8 1.91 8 ...
##  $ auc_CCLE : num  0 0.00726 0.07101 0.15734 0 ...
##  $ ic50_GDSC: num  155.27 219.93 92.18 3.06 19.63 ...
##  $ auc_GDSC : num  0.00394 0.00362 0.00762 0.06927 0.02876 ...

Resistance of Cell Lines

Something that was not considered in Haibe-Kains et al. (2013) was that different cell lines may more or less resistant (resistance means that the cell line does not respond to treatment by a drug) to drug treatment than others. Let’s explore first why that might be an important factor, and then directly compare cell line sensitivity in the two studies.

Consider the following situation: say we have a set of cell lines that are resistant to a particular drug (they don’t respond to it). Then we would expect that no matter what dose we give to the cell lines, their viability results would not change (they would stay near 100%). In this case, the AUC (area above the response curve) value would be near zero since the dose-response curve would look flat:

How would you calculate the IC50 value in this case?

Place your answer here

Assuming the results in both studies were consistent, then a scatterplot of the AUC values in this case would look like this (allowing for experimental error - we don’t expect to get exactly 100% viability at each dose due to variations in the experimental conditions and cellular growth rates):

AUC_study1 <- rbeta(200, 1, 5)
AUC_study2 <- rbeta(200, 1, 5)
resistant <- data.frame(AUC_study1, AUC_study2)

ggplot(resistant, aes(y = AUC_study2, x = AUC_study1)) +
    geom_point() +
    xlim(0, 1) +
    ylim(0, 1) +
    ggtitle("Simulated AUC of resistant cell lines")