Abstract

“Tidying PBM data. upbm package version: 0.99.0”

Introduction

The PBMExperiment class is the core structure defined in the upbm package for storing raw and normalized universal PBM data (see vignette("upbm-classes")). While the structure is useful for analysis and organization, often tabular data is much easier for computing quick summary statistics and performing exploratory analysis.

Since performing exploratory analysis with data stored in PBMExperiment objects is a fairly common task, we have defined a method for converting PBMExperiment assay data to tabular format. This is implemented as n extension to the broom::tidy function originally defined in the broom package.

In this vignette, we demonstrate the various uses of the broom::tidy function with PBMExperiment objects using the example HOXC9 dataset from the upbmData package.

HOXC9 Dataset

For details on the example HOXC9 dataset, see the quick start vignette in this package or the upbmData package vignette. Here, we will just use Alexa488 scans.

data(hoxc9alexa, package = "upbmData")
hoxc9alexa
## class: PBMExperiment 
## dim: 62976 30 
## metadata(0):
## assays(2): fore back
## rownames: NULL
## rowData names(4): Column Row probeID Sequence
## colnames(30): s1 s2 ... s35 s36
## colData names(10): date version ... condition id_idx
## probeCols(4): Column Row probeID Sequence
## probeFilter names(1): probeID
## probeTrim: 1 36

Tidy Data

“Tidy data” has become a popular and powerful framework for organizing data during interactive analysis. In the tidy data framework, data is organized as a data.frame with each row corresponding to an individual obervation or sample. Not only does the tidy data framework help keep data organized, but it also unlocks the powerful data parsing and visualization functions in the Tidyverse collection of packages.

To keep track of various probe and sample metadata compactly, uPBM data are not organized as tidy data. Instead, they are stored as PBMExperiment and PBMDesign objects which extend core Bioconductor data structures (see vignette("upbm-classes")). However, when performing interactive analysis, it can be useful to extract tidy data from the PBMExperiment objects.

The data for a single assay in PBMExperiment and SummarizedExperiment objects can be returned by passing the objects to broom::tidy.

broom::tidy(hoxc9alexa)
## # A tibble: 41,944 x 34
##       s1    s2    s3    s4    s5    s6    s7    s8    s9   s10   s11   s12
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  4307  3327  2127  9522  7123  4610  1137 21109 16053 10276  2387  4166
##  2  8101  6414  3884 18050 14217  8463  1209 40753 31384 19347  2577  4480
##  3  8754  7106  4955 19591 15547 10633  1140 43708 34539 23858  2408  4338
##  4  1842  1015   789  4150  2183  1632   720  9234  4734  3646  1500  2538
##  5  1980   927   740  4499  2039  1554   708 10014  4486  3479  1461  2569
##  6  7442  5524  3539 16744 12009  7820  1298 37478 27243 17370  2816  5265
##  7  3159  2498  1505  7032  5449  3201   812 15805 12036  7093  1689  2941
##  8  1586  1217   777  3493  2600  1673   558  7524  5793  3629  1175  1972
##  9  1465  1329   921  3231  2814  1940   725  7184  6128  4373  1748  3030
## 10  1930  1365   936  4323  2917  1992   663  9419  6528  4510  1437  2604
## # … with 41,934 more rows, and 22 more variables: s17 <dbl>, s18 <dbl>,
## #   s19 <dbl>, s20 <dbl>, s21 <dbl>, s22 <dbl>, s23 <dbl>, s24 <dbl>,
## #   s25 <dbl>, s26 <dbl>, s27 <dbl>, s28 <dbl>, s31 <dbl>, s32 <dbl>,
## #   s33 <dbl>, s34 <dbl>, s35 <dbl>, s36 <dbl>, Column <int>, Row <int>,
## #   probeID <chr>, Sequence <chr>

By default, the first assay in the object is returned as a wide tibble with columns corresponding to individual samples. Notice that the rowData are also included as columns in the tibble. Additionally, note that the number of rows is much smaller than the original PBMExperiment object.

The default behavior of broom::tidy is to perform any probe filtering and sequence trimming defined in the PBMDesign associated with the PBMExperiment object. In this case, probe sequences were trimmed to 36 nucleotides and all background and control probes were excluded. This filtering and trimming can be turned off by specifying process = FALSE.

broom::tidy(hoxc9alexa, process = FALSE)
## # A tibble: 62,976 x 34
##       s1    s2    s3    s4    s5    s6    s7    s8    s9   s10   s11   s12
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  4446  2191  2209  9780  4681  4710  2500 22446 10458 10511  5604  9888
##  2  4057  1978  2041  8967  4297  4373  2063 20142  9768  9920  4710  8363
##  3  5326  3443  4059 11924  7476  8833  6260 26805 16814 19558 14274 25597
##  4  5686  3410  4012 12573  7458  8886  6564 27970 16547 19477 14565 25868
##  5  4401  3444  2353  9943  7499  5112  1111 21987 16889 11481  2366  4255
##  6  4307  3327  2127  9522  7123  4610  1137 21109 16053 10276  2387  4166
##  7  8101  6414  3884 18050 14217  8463  1209 40753 31384 19347  2577  4480
##  8  8754  7106  4955 19591 15547 10633  1140 43708 34539 23858  2408  4338
##  9  1842  1015   789  4150  2183  1632   720  9234  4734  3646  1500  2538
## 10  1980   927   740  4499  2039  1554   708 10014  4486  3479  1461  2569
## # … with 62,966 more rows, and 22 more variables: s17 <dbl>, s18 <dbl>,
## #   s19 <dbl>, s20 <dbl>, s21 <dbl>, s22 <dbl>, s23 <dbl>, s24 <dbl>,
## #   s25 <dbl>, s26 <dbl>, s27 <dbl>, s28 <dbl>, s31 <dbl>, s32 <dbl>,
## #   s33 <dbl>, s34 <dbl>, s35 <dbl>, s36 <dbl>, Column <int>, Row <int>,
## #   probeID <chr>, Sequence <chr>

The assay can also be specified.

broom::tidy(hoxc9alexa, assay = "back")
## # A tibble: 41,944 x 34
##       s1    s2    s3    s4    s5    s6    s7    s8    s9   s10   s11   s12
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1   225   106   107   471   206   208   224  1007   401   436   434   766
##  2   238   129   193   503   243   395   222  1058   514   864   447   782
##  3   235   147   263   505   286   544   216  1069   583  1145   434   784
##  4   206   123   114   430   229   230   213   910   482   465   434   780
##  5   199   101    97   399   177   179   216   850   358   341   444   750
##  6   254   114   160   520   217   281   213  1093   404   592   430   751
##  7   416   299   220   867   601   426   304  1878  1409   963   614  1106
##  8   191   184   112   395   383   214   229   819   780   440   488   843
##  9   155   120   122   314   232   240   224   658   485   489   458   824
## 10   155   119   134   317   238   259   220   658   495   515   445   817
## # … with 41,934 more rows, and 22 more variables: s17 <dbl>, s18 <dbl>,
## #   s19 <dbl>, s20 <dbl>, s21 <dbl>, s22 <dbl>, s23 <dbl>, s24 <dbl>,
## #   s25 <dbl>, s26 <dbl>, s27 <dbl>, s28 <dbl>, s31 <dbl>, s32 <dbl>,
## #   s33 <dbl>, s34 <dbl>, s35 <dbl>, s36 <dbl>, Column <int>, Row <int>,
## #   probeID <chr>, Sequence <chr>

While returning a wide tibble maintains the original shape of the assay data, with tidy data, we often prefer each row to correspond to an single observation in “long” format. We can return a long tibble by specifying long = TRUE.

broom::tidy(hoxc9alexa, assay = "back", long = TRUE)
## # A tibble: 1,258,320 x 16
##    Column   Row probeID Sequence cname  back   date version    id reuse
##     <int> <int> <chr>   <chr>    <chr> <dbl>  <dbl> <chr>   <dbl> <dbl>
##  1      6     1 dBr_14… GGTGTGA… s1      225 170606 v14       226     1
##  2      7     1 dBr_06… CAGTCTA… s1      238 170606 v14       226     1
##  3      8     1 dBr_39… CTTTTTA… s1      235 170606 v14       226     1
##  4      9     1 dBr_06… CAGCTAC… s1      206 170606 v14       226     1
##  5     10     1 dBr_05… GCTTCGA… s1      199 170606 v14       226     1
##  6     15     1 dBr_16… CGCCCGT… s1      254 170606 v14       226     1
##  7     23     1 dBr_20… TTAGCCC… s1      416 170606 v14       226     1
##  8     24     1 dBr_25… TGCACAA… s1      191 170606 v14       226     1
##  9     26     1 dBr_40… GGATGCC… s1      155 170606 v14       226     1
## 10     27     1 dBr_21… GTCAGAA… s1      155 170606 v14       226     1
## # … with 1,258,310 more rows, and 6 more variables: type <chr>, pmt <dbl>,
## #   idx <dbl>, target <chr>, condition <chr>, id_idx <chr>

When long = TRUE, the column names are placed in a cname column of the tibble and the assay values are included in a column matching the assay name (here, back). In addition to the column name, assay values, and rowData, in long format, colData is also included in the output.

Tidying of multiple assays is also supported.

broom::tidy(hoxc9alexa, assay = c("fore", "back"))
## # A tibble: 1,258,320 x 17
##    Column   Row probeID Sequence cname  fore  back   date version    id
##     <int> <int> <chr>   <chr>    <chr> <dbl> <dbl>  <dbl> <chr>   <dbl>
##  1      6     1 dBr_14… GGTGTGA… s1     4307   225 170606 v14       226
##  2      7     1 dBr_06… CAGTCTA… s1     8101   238 170606 v14       226
##  3      8     1 dBr_39… CTTTTTA… s1     8754   235 170606 v14       226
##  4      9     1 dBr_06… CAGCTAC… s1     1842   206 170606 v14       226
##  5     10     1 dBr_05… GCTTCGA… s1     1980   199 170606 v14       226
##  6     15     1 dBr_16… CGCCCGT… s1     7442   254 170606 v14       226
##  7     23     1 dBr_20… TTAGCCC… s1     3159   416 170606 v14       226
##  8     24     1 dBr_25… TGCACAA… s1     1586   191 170606 v14       226
##  9     26     1 dBr_40… GGATGCC… s1     1465   155 170606 v14       226
## 10     27     1 dBr_21… GTCAGAA… s1     1930   155 170606 v14       226
## # … with 1,258,310 more rows, and 7 more variables: reuse <dbl>,
## #   type <chr>, pmt <dbl>, idx <dbl>, target <chr>, condition <chr>,
## #   id_idx <chr>

When mutliple assays are specified, the data will be returned as a long tibble.

While we have described how to call broom::tidy with PBMExperiment objects, more generally, the function can also be applied to any SummarizedExperiment object.

se <- as(hoxc9alexa, "SummarizedExperiment")
broom::tidy(se, long = TRUE)
## # A tibble: 1,889,280 x 16
##    Column   Row probeID Sequence cname  fore   date version    id reuse
##     <int> <int> <chr>   <chr>    <chr> <dbl>  <dbl> <chr>   <dbl> <dbl>
##  1      1     1 GE_Bri… #N/A     s1     4446 170606 v14       226     1
##  2      2     1 GE_Bri… #N/A     s1     4057 170606 v14       226     1
##  3      3     1 DarkCo… #N/A     s1     5326 170606 v14       226     1
##  4      4     1 DarkCo… #N/A     s1     5686 170606 v14       226     1
##  5      5     1 Cbf_5b… GAAGCTA… s1     4401 170606 v14       226     1
##  6      6     1 dBr_14… GGTGTGA… s1     4307 170606 v14       226     1
##  7      7     1 dBr_06… CAGTCTA… s1     8101 170606 v14       226     1
##  8      8     1 dBr_39… CTTTTTA… s1     8754 170606 v14       226     1
##  9      9     1 dBr_06… CAGCTAC… s1     1842 170606 v14       226     1
## 10     10     1 dBr_05… GCTTCGA… s1     1980 170606 v14       226     1
## # … with 1,889,270 more rows, and 6 more variables: type <chr>, pmt <dbl>,
## #   idx <dbl>, target <chr>, condition <chr>, id_idx <chr>

Notice that when calling broom::tidy on the SummarizedExperiment object, background probes are not filtered. Similarly, probe sequences are not trimmed. These features are unique to PBMExperiment objects and are lost when converting the data to a SummarizedExperiment object.