Tutorial 0b: Introduction to the Tidyverse

Introduction

Caveat

Before getting started, we would like to point out that learning these packages is not necessary for performing data analysis in R. The R language has existed for 25 years, and while popular, the tidyverse is a relatively new addition to the R ecosystem of packages. Many statisticians, data scientists and other scientists are happy (and highly skilled!) performing data analysis in R without using the tidyverse packages. Most tasks that can be accomplished using packages in the tidyverse can be similarly completed using base R. However, in many cases (in particular, data manipulation), it can be much easier with the tools in the tidyverse. For this and many other reasons (including excellent online documentation and a large user community), we hope that you’ll give the tidyverse a try!

Scope

If you’re still interested and reading, great!

This tutorial will go over some basics about programming in R using packages in the tidyverse. The tidyverse is a family of related R packages developed to streamline data science in R. If you’ve ever used the ggplot2 package to create plots, you’ve already experienced part of the tidyverse! The core tidyverse packages include ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats.

Phew! That’s a lot of packages!

Unforunately, we don’t have time to cover all of them. Instead, we’ll give a light introduction to a couple of the packages that will be helpful for working with large datasets:

dplyr : for data manipulation,
tidyr : for “tidy”ing data.

We will assume that you’ve had some exposure to the powerful plotting capabilities of the ggplot2 package through other resources.

Installing the Tidyverse

To get started, we will need to install the tidyverse family of packages.

install.packages("tidyverse")

If the packages are installed without any errors, we can load them as usual.

library(tidyverse)

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.2     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Notice that the core tidyverse packages are listed under Attaching packages and loaded all at once. How wonderful!

To demonstrate the basic usage of these packages, we also import the raw and summarized pharmacological datasets that we’ll be analyzing today.

pharmacoData <- readRDS(file.path("..", "data", "rawPharmacoData.rds"))
str(pharmacoData)

## 'data.frame':    43427 obs. of  6 variables:
##  $ cellLine     : chr  "22RV1" "22RV1" "22RV1" "22RV1" ...
##  $ drug         : chr  "17-AAG" "17-AAG" "17-AAG" "17-AAG" ...
##  $ doseID       : chr  "doses1" "doses2" "doses3" "doses4" ...
##  $ concentration: num  0.0025 0.008 0.025 0.08 0.25 0.8 2.53 8 0.0025 0.008 ...
##  $ viability    : num  94.1 86 99.9 85 62 ...
##  $ study        : chr  "CCLE" "CCLE" "CCLE" "CCLE" ...

summarizedData <- readRDS(file.path("..", "data", "summarizedPharmacoData.rds"))
str(summarizedData)

## 'data.frame':    2557 obs. of  6 variables:
##  $ cellLine : chr  "22RV1" "5637" "639-V" "697" ...
##  $ drug     : chr  "Nilotinib" "Nilotinib" "Nilotinib" "Nilotinib" ...
##  $ ic50_CCLE: num  8 7.48 8 1.91 8 ...
##  $ auc_CCLE : num  0 0.00726 0.07101 0.15734 0 ...
##  $ ic50_GDSC: num  155.27 219.93 92.18 3.06 19.63 ...
##  $ auc_GDSC : num  0.00394 0.00362 0.00762 0.06927 0.02876 ...

The Pipe `%>%`

The “pipe” symbol (%>%) is a commonly used feature of the tidyverse. It was originally defined in the (cleverly named) magrittr package, but is also included in the dplyr tidyverse package. The %>% symbol can seem confusing and intimidating at first. However, once you understand the basic idea, it can become addicting!

The %>% symbol is placed between a value on the left and a function on the right. The %>% simply takes the value to the left and passes it to the function on the right as the first argument. It acts as a “pipe”. That’s it!

Suppose we have a variable, x.

x <- 9

The following are the exact same.

sqrt(x)

## [1] 3

x %>% sqrt()

## [1] 3

As a slightly more complex example, the following calls to ggplot are also equivalent.

gp1 <- pharmacoData %>%
    ggplot(aes(x = concentration, y = viability))

gp2 <- ggplot(pharmacoData,
              aes(x = concentration, y = viability))

That’s it! We’ll continue to use %>% throughout this tutorial to show how useful it can be for chaining various data manipulation steps during an analysis.

The `dplyr` Package

Most of this tutorial will be spent describing several functions in the dplyr package for working with tabular data. Remember, the best way to get a feel for these functions is to use them to answer questions about your data. We have included several examples for using these dplyr functions with the pharmacoData dataset.

There are many more functions in the dplyr package that we won’t have time to cover here. More details on all of the useful functions defined in the package can be found on the dplyr reference page.

Subsetting

First, let’s take a look at subsetting the data. To subset rows in a table based on values in a column, use the filter function.

The following examples filter the data on a single drug and a singe cell line, respectively.

nilotinibData <- filter(pharmacoData, drug == "Nilotinib")
head(nilotinibData)

##   cellLine      drug doseID concentration viability study
## 1    22RV1 Nilotinib doses1        0.0025    109.98  CCLE
## 2    22RV1 Nilotinib doses2        0.0080    107.66  CCLE
## 3    22RV1 Nilotinib doses3        0.0250     97.80  CCLE
## 4    22RV1 Nilotinib doses4        0.0800    115.10  CCLE
## 5    22RV1 Nilotinib doses5        0.2500    129.50  CCLE
## 6    22RV1 Nilotinib doses6        0.8000    121.20  CCLE

cl639vData <- filter(pharmacoData, cellLine == "639-V")
head(cl639vData)

##   cellLine   drug doseID concentration viability study
## 1    639-V 17-AAG doses1        0.0025      90.0  CCLE
## 2    639-V 17-AAG doses2        0.0080      86.0  CCLE
## 3    639-V 17-AAG doses3        0.0250      98.8  CCLE
## 4    639-V 17-AAG doses4        0.0800      77.0  CCLE
## 5    639-V 17-AAG doses5        0.2500      26.0  CCLE
## 6    639-V 17-AAG doses6        0.8000      13.0  CCLE

We can also combine multiple filters.

n6Data <- filter(pharmacoData,
                 drug == "Nilotinib",
                 cellLine == "639-V")
head(n6Data)

##   cellLine      drug doseID concentration viability study
## 1    639-V Nilotinib doses1        0.0025    106.81  CCLE
## 2    639-V Nilotinib doses2        0.0080     85.00  CCLE
## 3    639-V Nilotinib doses3        0.0250     94.90  CCLE
## 4    639-V Nilotinib doses4        0.0800     95.50  CCLE
## 5    639-V Nilotinib doses5        0.2500    102.62  CCLE
## 6    639-V Nilotinib doses6        0.8000    103.57  CCLE

The distinct function is a quick way to just take the unique rows in a table. The function can be called with zero or more columns specified. If any columns are specified, only unique rows for those columns will be returned.

The following returns the unique cell line and drug combinations in our data.

cldData <- distinct(pharmacoData, cellLine, drug)
head(cldData)

##   cellLine       drug
## 1    22RV1     17-AAG
## 2    22RV1    AZD6244
## 3    22RV1  Nilotinib
## 4    22RV1   Nutlin-3
## 5    22RV1 PD-0325901
## 6    22RV1 PD-0332991

dim(cldData)

## [1] 2557    2

To subset columns, use the select function. The following example returns a smaller table with just the cellLine and drug columns.

subdat <- select(pharmacoData, cellLine, drug)
head(subdat)

##   cellLine   drug
## 1    22RV1 17-AAG
## 2    22RV1 17-AAG
## 3    22RV1 17-AAG
## 4    22RV1 17-AAG
## 5    22RV1 17-AAG
## 6    22RV1 17-AAG

Modifying

Now that we know how to subset columns, what about adding columns? This can be done with the mutate function. Suppose instead of concentrations, we want to look at the data with log2 concentrations. We can add a new column to the table with the following call.

pharmacoData %>%
    dplyr::mutate(logConcentration = log2(concentration)) %>%
    head()

##   cellLine   drug doseID concentration viability study logConcentration
## 1    22RV1 17-AAG doses1        0.0025    94.100  CCLE       -8.6438562
## 2    22RV1 17-AAG doses2        0.0080    86.000  CCLE       -6.9657843
## 3    22RV1 17-AAG doses3        0.0250    99.932  CCLE       -5.3219281
## 4    22RV1 17-AAG doses4        0.0800    85.000  CCLE       -3.6438562
## 5    22RV1 17-AAG doses5        0.2500    62.000  CCLE       -2.0000000
## 6    22RV1 17-AAG doses6        0.8000    29.000  CCLE       -0.3219281

Simple enough! Notice that the new column is added as “logConcentration”, as specified in the call to mutate. What would have happened if we had set the new column to “concetration” (the name of an existing column)? Give it a try!

Remember, if you want to keep the new columns, you’ll have to assign the modified table to a variable.

pharmacoData <- pharmacoData %>%
    dplyr::mutate(logConcentration = log2(concentration))
pharmacoData

##    cellLine    drug doseID concentration viability study logConcentration
## 1     22RV1  17-AAG doses1        0.0025    94.100  CCLE       -8.6438562
## 2     22RV1  17-AAG doses2        0.0080    86.000  CCLE       -6.9657843
## 3     22RV1  17-AAG doses3        0.0250    99.932  CCLE       -5.3219281
## 4     22RV1  17-AAG doses4        0.0800    85.000  CCLE       -3.6438562
## 5     22RV1  17-AAG doses5        0.2500    62.000  CCLE       -2.0000000
## 6     22RV1  17-AAG doses6        0.8000    29.000  CCLE       -0.3219281
## 7     22RV1  17-AAG doses7        2.5300    26.000  CCLE        1.3391374
## 8     22RV1  17-AAG doses8        8.0000    20.000  CCLE        3.0000000
## 9     22RV1 AZD6244 doses1        0.0025    95.900  CCLE       -8.6438562
## 10    22RV1 AZD6244 doses2        0.0080    77.000  CCLE       -6.9657843
## 11    22RV1 AZD6244 doses3        0.0250    58.000  CCLE       -5.3219281
## 12    22RV1 AZD6244 doses4        0.0800    62.000  CCLE       -3.6438562
## 13    22RV1 AZD6244 doses5        0.2500    62.000  CCLE       -2.0000000
## 14    22RV1 AZD6244 doses6        0.8000    64.000  CCLE       -0.3219281
##  [ reached 'max' / getOption("max.print") -- omitted 43413 rows ]

Summarizing

Another useful set of functions in the dplyr package allow for aggregating across the rows of a table. Suppose we want to compute some summary measures of the viability scores.

pharmacoData %>%
    summarize(minViability = min(viability),
              maxViability = max(viability),
              avgViability = mean(viability))

##   minViability maxViability avgViability
## 1          -20     319.4919     88.12281

Great!

For the simple case of counting the occurrences of the unique values in a column, use count. The following example counts the number of rows in the table corresponding to each study.

count(pharmacoData, study)

## # A tibble: 2 x 2
##   study     n
##   <chr> <int>
## 1 CCLE  20414
## 2 GDSC  23013

Interesting! It looks like we have slightly more data from the GDSC study.

Grouping

Summarization of the entire table is great, but often we want to summarize by groups. For example, instead of just computing the minimum, maximum and average viability across all viability measures, what about computing these values for the CCLE and GDSC studies separately?

To do this, dplyr includes the group_by function. All we have to do is “group” by study before calling summarize as we did above.

pharmacoData %>%
    group_by(study) %>%
    summarize(minViability = min(viability),
              maxViability = max(viability),
              avgViability = mean(viability))

## # A tibble: 2 x 4
##   study minViability maxViability avgViability
##   <chr>        <dbl>        <dbl>        <dbl>
## 1 CCLE         -20           201          85.9
## 2 GDSC         -14.2         319.         90.1

Amazing, right?

Tip: Always remember to upgroup your data after you’re finished performing operations on the groups. Forgetting that your data is still “grouped” can cause major headaches while performing data analysis! If you’re not sure if the data is grouped, just ungroup! (There’s no harm in calling ungroup too often.)

pharmacoData %>%
    group_by(cellLine, drug, study) %>%
    mutate(viability = viability / max(viability) * 100) %>%
    ungroup()

## # A tibble: 43,427 x 7
##    cellLine drug    doseID concentration viability study logConcentration
##    <chr>    <chr>   <chr>          <dbl>     <dbl> <chr>            <dbl>
##  1 22RV1    17-AAG  doses1        0.0025      94.2 CCLE            -8.64 
##  2 22RV1    17-AAG  doses2        0.008       86.1 CCLE            -6.97 
##  3 22RV1    17-AAG  doses3        0.025      100   CCLE            -5.32 
##  4 22RV1    17-AAG  doses4        0.08        85.1 CCLE            -3.64 
##  5 22RV1    17-AAG  doses5        0.25        62.0 CCLE            -2    
##  6 22RV1    17-AAG  doses6        0.8         29.0 CCLE            -0.322
##  7 22RV1    17-AAG  doses7        2.53        26.0 CCLE             1.34 
##  8 22RV1    17-AAG  doses8        8           20.0 CCLE             3    
##  9 22RV1    AZD6244 doses1        0.0025     100   CCLE            -8.64 
## 10 22RV1    AZD6244 doses2        0.008       80.3 CCLE            -6.97 
## # … with 43,417 more rows

Can you figure out what we’re doing in the code above?

Joining

Finally, the dplyr package includes several functions for combining multiple tables. These functions are incredibly useful for combining multiple tables with partially overlapping data. For example, what if we want to combine the raw and summarized pharmacological datasets?

Notice that both datasets include columns with cellLine and drug information.

head(pharmacoData)

##   cellLine   drug doseID concentration viability study logConcentration
## 1    22RV1 17-AAG doses1        0.0025    94.100  CCLE       -8.6438562
## 2    22RV1 17-AAG doses2        0.0080    86.000  CCLE       -6.9657843
## 3    22RV1 17-AAG doses3        0.0250    99.932  CCLE       -5.3219281
## 4    22RV1 17-AAG doses4        0.0800    85.000  CCLE       -3.6438562
## 5    22RV1 17-AAG doses5        0.2500    62.000  CCLE       -2.0000000
## 6    22RV1 17-AAG doses6        0.8000    29.000  CCLE       -0.3219281

head(summarizedData)

##   cellLine      drug ic50_CCLE  auc_CCLE  ic50_GDSC auc_GDSC
## 1    22RV1 Nilotinib  8.000000 0.0000000 155.269917 0.003935
## 2     5637 Nilotinib  7.475355 0.0072625 219.934550 0.003616
## 3    639-V Nilotinib  8.000000 0.0710125  92.177125 0.007622
## 4      697 Nilotinib  1.910434 0.1573375   3.063552 0.069265
## 5    769-P Nilotinib  8.000000 0.0000000  19.633514 0.028758
## 6    786-0 Nilotinib  8.000000 0.0750125 137.066882 0.005482

We will use the full_join function to combine these two tables and specify that this should be done by matching the cellLine and drug columns.

fullData <- full_join(pharmacoData, summarizedData, by = c("cellLine", "drug"))
head(fullData)

##   cellLine   drug doseID concentration viability study logConcentration
## 1    22RV1 17-AAG doses1        0.0025    94.100  CCLE       -8.6438562
## 2    22RV1 17-AAG doses2        0.0080    86.000  CCLE       -6.9657843
## 3    22RV1 17-AAG doses3        0.0250    99.932  CCLE       -5.3219281
## 4    22RV1 17-AAG doses4        0.0800    85.000  CCLE       -3.6438562
## 5    22RV1 17-AAG doses5        0.2500    62.000  CCLE       -2.0000000
## 6    22RV1 17-AAG doses6        0.8000    29.000  CCLE       -0.3219281
##   ic50_CCLE auc_CCLE ic50_GDSC auc_GDSC
## 1 0.3297017  0.37246  3.491684 0.067091
## 2 0.3297017  0.37246  3.491684 0.067091
## 3 0.3297017  0.37246  3.491684 0.067091
## 4 0.3297017  0.37246  3.491684 0.067091
## 5 0.3297017  0.37246  3.491684 0.067091
## 6 0.3297017  0.37246  3.491684 0.067091

Notice that we now have a single table with the columns from both tables. There are several other functions for merging tables, including left_join, inner_join, and anti_join. To learn more about how these differ, take a look at the documentation page.

The `tidyr` Package

While the dplyr package includes many functions for manipulating data, we would have a hard time using these tools if our data is not in the correct “form”. For example, what if we want to compare the viability scores in pharmacoData for two drugs in the CCLE study? To do this, we would want the two drugs to be in separate columns, so that we can compare them side-by-side. No amount of subsetting or mutating the table will get us there. We need to fundamentally transform the shape of our data.

This is where the tidyr package comes in. The tidyr package includes several functions to help with arranging and rearranging our data. We will highlight the two most important functions for this task:

spread: to spread values in a single column to multiple columns,
gather: to gather values in multiple columns to a single column.

Again, there are many more functions in the tidyr package that we won’t have time to cover here. More details can be found on the tidyr reference page.

Don’t worry if it takes some time for these ideas to start making sense! At first, transforming data with tidyr can feel like mental yoga.

Spreading

To demonstrate spreading data, let’s consider the example described above. Suppose we would like to compare the viability scores for two drugs in the CCLE study, lapatinib and paclitaxel, across cell lines and concentrations.

First, using the dplyr functions from above, we’ll subset the data.

subdat <- pharmacoData %>%
    filter(study == "CCLE",
           drug %in% c("lapatinib", "paclitaxel")) %>%
    select(cellLine, drug, concentration, viability)
head(subdat)

##   cellLine      drug concentration viability
## 1      697 lapatinib        0.0025     89.00
## 2      697 lapatinib        0.0080    104.52
## 3      697 lapatinib        0.0250    117.90
## 4      697 lapatinib        0.0800    140.10
## 5      697 lapatinib        0.2500     96.00
## 6      697 lapatinib        0.8000     83.00

Next, we will use the spread function to take the viability scores for the two drugs into separate columns. How do we do this? We would like to take the values in the drug column and turn these into our new columns. We then want to populate these columns with values from the viability column. To do this, we simply specify drug as the “key” and viability as the “value” to the `spread function.

subdat_wide <- spread(subdat, key = drug, value = viability)
head(subdat_wide)

##   cellLine concentration lapatinib paclitaxel
## 1      697        0.0025     89.00         12
## 2      697        0.0080    104.52          3
## 3      697        0.0250    117.90          3
## 4      697        0.0800    140.10          2
## 5      697        0.2500     96.00          1
## 6      697        0.8000     83.00          1

Great! Notice that the data in the other columns (cellLine and concentration) are still there. When populating the lapatinib and paclitaxel columns, the spread function will make sure to keep track of the remaining columns.

Gathering

Next, let’s use the summarizedData to demonstrate how gather works.

head(summarizedData)

##   cellLine      drug ic50_CCLE  auc_CCLE  ic50_GDSC auc_GDSC
## 1    22RV1 Nilotinib  8.000000 0.0000000 155.269917 0.003935
## 2     5637 Nilotinib  7.475355 0.0072625 219.934550 0.003616
## 3    639-V Nilotinib  8.000000 0.0710125  92.177125 0.007622
## 4      697 Nilotinib  1.910434 0.1573375   3.063552 0.069265
## 5    769-P Nilotinib  8.000000 0.0000000  19.633514 0.028758
## 6    786-0 Nilotinib  8.000000 0.0750125 137.066882 0.005482

Suppose we now want to organize all of the IC50 and AUC values stored in the separate ic50_CCLE, auc_CCLE, ic50_GDSC and auc_GDSC columns into a single column of “metric values”. Essentially, we would like to reverse the “spreading” procedure that we carried out above. To do this, we simply call gather with what we want our “key” and “value” columns to be named.

summarizedDataLong <- gather(summarizedData, metric, value)
head(summarizedDataLong)

##     metric value
## 1 cellLine 22RV1
## 2 cellLine  5637
## 3 cellLine 639-V
## 4 cellLine   697
## 5 cellLine 769-P
## 6 cellLine 786-0

Oh no! Something went wrong here. Remember, we only want to “gather” the four columns with the IC50 and AUC values. Here, we’ve gathered all of the columns (including the cellLine and drug columns).

To specify that we only want to gather the four columns, we pass these column names to gather.

summarizedDataLong <- gather(summarizedData, key = metric, value = value,
                             ic50_CCLE, auc_CCLE, ic50_GDSC, auc_GDSC)
head(summarizedDataLong)

##   cellLine      drug    metric    value
## 1    22RV1 Nilotinib ic50_CCLE 8.000000
## 2     5637 Nilotinib ic50_CCLE 7.475355
## 3    639-V Nilotinib ic50_CCLE 8.000000
## 4      697 Nilotinib ic50_CCLE 1.910434
## 5    769-P Nilotinib ic50_CCLE 8.000000
## 6    786-0 Nilotinib ic50_CCLE 8.000000

Alternatively, since the number of columns we would like to exclude is smaller, we can specify these columns with the a “minus”.

## alternatively
summarizedDataLong <- gather(summarizedData, key = metric, value = value,
                             -cellLine, -drug)
head(summarizedDataLong)

##   cellLine      drug    metric    value
## 1    22RV1 Nilotinib ic50_CCLE 8.000000
## 2     5637 Nilotinib ic50_CCLE 7.475355
## 3    639-V Nilotinib ic50_CCLE 8.000000
## 4      697 Nilotinib ic50_CCLE 1.910434
## 5    769-P Nilotinib ic50_CCLE 8.000000
## 6    786-0 Nilotinib ic50_CCLE 8.000000

The result is the same!

Now, we have a new “key” column (metric) with the names of the columns that were gathered and a new “value” column (value) with the original entries of those columns.

The gather and spread functions are opposites. Therefore, we can undo the gather operation above by calling spread.

summarizedDataUndo <- spread(summarizedDataLong, key = metric, value = value)
head(summarizedDataUndo)

##   cellLine       drug auc_CCLE auc_GDSC ic50_CCLE  ic50_GDSC
## 1    22RV1     17-AAG 0.372460 0.067091 0.3297017   3.491684
## 2    22RV1    AZD6244 0.355125 0.055585 8.0000000  63.866949
## 3    22RV1  Nilotinib 0.000000 0.003935 8.0000000 155.269917
## 4    22RV1   Nutlin-3 0.076925 0.106564 8.0000000  12.805900
## 5    22RV1 PD-0325901 0.385000 0.029649 8.0000000   4.707269
## 6    22RV1 PD-0332991 0.041625 0.225682 8.0000000   1.678425

We are back to our original dataset!

References

For a complete book on how to do data science using R and the tidyverse, we highly recommend R for Data Science, available for free online, by Garrett Grolemund and Hadley Wickham.

More practically, the accompanying websites for the tidyverse packages are absolutely amazing. These sites are a great resource for trying to understand how these packages work. Also, when in doubt ask the internet.