vignettes/other-tidydata.Rmd
other-tidydata.Rmd
Abstract
“Tidying PBM data. upbm package version: 0.99.0”
The PBMExperiment class is the core structure defined in the upbm package for storing raw and normalized universal PBM data (see vignette("upbm-classes")
). While the structure is useful for analysis and organization, often tabular data is much easier for computing quick summary statistics and performing exploratory analysis.
suppressPackageStartupMessages(library("upbm"))
Since performing exploratory analysis with data stored in PBMExperiment objects is a fairly common task, we have defined a method for converting PBMExperiment assay data to tabular format. This is implemented as n extension to the broom::tidy
function originally defined in the broom package.
In this vignette, we demonstrate the various uses of the broom::tidy
function with PBMExperiment objects using the example HOXC9 dataset from the upbmData package.
For details on the example HOXC9 dataset, see the quick start vignette in this package or the upbmData package vignette. Here, we will just use Alexa488 scans.
data(hoxc9alexa, package = "upbmData")
hoxc9alexa
## class: PBMExperiment
## dim: 62976 30
## metadata(0):
## assays(2): fore back
## rownames: NULL
## rowData names(4): Column Row probeID Sequence
## colnames(30): s1 s2 ... s35 s36
## colData names(10): date version ... condition id_idx
## probeCols(4): Column Row probeID Sequence
## probeFilter names(1): probeID
## probeTrim: 1 36
“Tidy data” has become a popular and powerful framework for organizing data during interactive analysis. In the tidy data framework, data is organized as a data.frame with each row corresponding to an individual obervation or sample. Not only does the tidy data framework help keep data organized, but it also unlocks the powerful data parsing and visualization functions in the Tidyverse collection of packages.
To keep track of various probe and sample metadata compactly, uPBM data are not organized as tidy data. Instead, they are stored as PBMExperiment and PBMDesign objects which extend core Bioconductor data structures (see vignette("upbm-classes")
). However, when performing interactive analysis, it can be useful to extract tidy data from the PBMExperiment objects.
The data for a single assay in PBMExperiment and SummarizedExperiment objects can be returned by passing the objects to broom::tidy
.
broom::tidy(hoxc9alexa)
## # A tibble: 41,944 x 34
## s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4307 3327 2127 9522 7123 4610 1137 21109 16053 10276 2387 4166
## 2 8101 6414 3884 18050 14217 8463 1209 40753 31384 19347 2577 4480
## 3 8754 7106 4955 19591 15547 10633 1140 43708 34539 23858 2408 4338
## 4 1842 1015 789 4150 2183 1632 720 9234 4734 3646 1500 2538
## 5 1980 927 740 4499 2039 1554 708 10014 4486 3479 1461 2569
## 6 7442 5524 3539 16744 12009 7820 1298 37478 27243 17370 2816 5265
## 7 3159 2498 1505 7032 5449 3201 812 15805 12036 7093 1689 2941
## 8 1586 1217 777 3493 2600 1673 558 7524 5793 3629 1175 1972
## 9 1465 1329 921 3231 2814 1940 725 7184 6128 4373 1748 3030
## 10 1930 1365 936 4323 2917 1992 663 9419 6528 4510 1437 2604
## # … with 41,934 more rows, and 22 more variables: s17 <dbl>, s18 <dbl>,
## # s19 <dbl>, s20 <dbl>, s21 <dbl>, s22 <dbl>, s23 <dbl>, s24 <dbl>,
## # s25 <dbl>, s26 <dbl>, s27 <dbl>, s28 <dbl>, s31 <dbl>, s32 <dbl>,
## # s33 <dbl>, s34 <dbl>, s35 <dbl>, s36 <dbl>, Column <int>, Row <int>,
## # probeID <chr>, Sequence <chr>
By default, the first assay in the object is returned as a wide tibble with columns corresponding to individual samples. Notice that the rowData are also included as columns in the tibble. Additionally, note that the number of rows is much smaller than the original PBMExperiment object.
The default behavior of broom::tidy
is to perform any probe filtering and sequence trimming defined in the PBMDesign associated with the PBMExperiment object. In this case, probe sequences were trimmed to 36 nucleotides and all background and control probes were excluded. This filtering and trimming can be turned off by specifying process = FALSE
.
broom::tidy(hoxc9alexa, process = FALSE)
## # A tibble: 62,976 x 34
## s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4446 2191 2209 9780 4681 4710 2500 22446 10458 10511 5604 9888
## 2 4057 1978 2041 8967 4297 4373 2063 20142 9768 9920 4710 8363
## 3 5326 3443 4059 11924 7476 8833 6260 26805 16814 19558 14274 25597
## 4 5686 3410 4012 12573 7458 8886 6564 27970 16547 19477 14565 25868
## 5 4401 3444 2353 9943 7499 5112 1111 21987 16889 11481 2366 4255
## 6 4307 3327 2127 9522 7123 4610 1137 21109 16053 10276 2387 4166
## 7 8101 6414 3884 18050 14217 8463 1209 40753 31384 19347 2577 4480
## 8 8754 7106 4955 19591 15547 10633 1140 43708 34539 23858 2408 4338
## 9 1842 1015 789 4150 2183 1632 720 9234 4734 3646 1500 2538
## 10 1980 927 740 4499 2039 1554 708 10014 4486 3479 1461 2569
## # … with 62,966 more rows, and 22 more variables: s17 <dbl>, s18 <dbl>,
## # s19 <dbl>, s20 <dbl>, s21 <dbl>, s22 <dbl>, s23 <dbl>, s24 <dbl>,
## # s25 <dbl>, s26 <dbl>, s27 <dbl>, s28 <dbl>, s31 <dbl>, s32 <dbl>,
## # s33 <dbl>, s34 <dbl>, s35 <dbl>, s36 <dbl>, Column <int>, Row <int>,
## # probeID <chr>, Sequence <chr>
The assay can also be specified.
broom::tidy(hoxc9alexa, assay = "back")
## # A tibble: 41,944 x 34
## s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 225 106 107 471 206 208 224 1007 401 436 434 766
## 2 238 129 193 503 243 395 222 1058 514 864 447 782
## 3 235 147 263 505 286 544 216 1069 583 1145 434 784
## 4 206 123 114 430 229 230 213 910 482 465 434 780
## 5 199 101 97 399 177 179 216 850 358 341 444 750
## 6 254 114 160 520 217 281 213 1093 404 592 430 751
## 7 416 299 220 867 601 426 304 1878 1409 963 614 1106
## 8 191 184 112 395 383 214 229 819 780 440 488 843
## 9 155 120 122 314 232 240 224 658 485 489 458 824
## 10 155 119 134 317 238 259 220 658 495 515 445 817
## # … with 41,934 more rows, and 22 more variables: s17 <dbl>, s18 <dbl>,
## # s19 <dbl>, s20 <dbl>, s21 <dbl>, s22 <dbl>, s23 <dbl>, s24 <dbl>,
## # s25 <dbl>, s26 <dbl>, s27 <dbl>, s28 <dbl>, s31 <dbl>, s32 <dbl>,
## # s33 <dbl>, s34 <dbl>, s35 <dbl>, s36 <dbl>, Column <int>, Row <int>,
## # probeID <chr>, Sequence <chr>
While returning a wide tibble maintains the original shape of the assay data, with tidy data, we often prefer each row to correspond to an single observation in “long” format. We can return a long tibble by specifying long = TRUE
.
broom::tidy(hoxc9alexa, assay = "back", long = TRUE)
## # A tibble: 1,258,320 x 16
## Column Row probeID Sequence cname back date version id reuse
## <int> <int> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 6 1 dBr_14… GGTGTGA… s1 225 170606 v14 226 1
## 2 7 1 dBr_06… CAGTCTA… s1 238 170606 v14 226 1
## 3 8 1 dBr_39… CTTTTTA… s1 235 170606 v14 226 1
## 4 9 1 dBr_06… CAGCTAC… s1 206 170606 v14 226 1
## 5 10 1 dBr_05… GCTTCGA… s1 199 170606 v14 226 1
## 6 15 1 dBr_16… CGCCCGT… s1 254 170606 v14 226 1
## 7 23 1 dBr_20… TTAGCCC… s1 416 170606 v14 226 1
## 8 24 1 dBr_25… TGCACAA… s1 191 170606 v14 226 1
## 9 26 1 dBr_40… GGATGCC… s1 155 170606 v14 226 1
## 10 27 1 dBr_21… GTCAGAA… s1 155 170606 v14 226 1
## # … with 1,258,310 more rows, and 6 more variables: type <chr>, pmt <dbl>,
## # idx <dbl>, target <chr>, condition <chr>, id_idx <chr>
When long = TRUE
, the column names are placed in a cname
column of the tibble and the assay values are included in a column matching the assay name (here, back
). In addition to the column name, assay values, and rowData, in long format, colData is also included in the output.
Tidying of multiple assays is also supported.
## # A tibble: 1,258,320 x 17
## Column Row probeID Sequence cname fore back date version id
## <int> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 6 1 dBr_14… GGTGTGA… s1 4307 225 170606 v14 226
## 2 7 1 dBr_06… CAGTCTA… s1 8101 238 170606 v14 226
## 3 8 1 dBr_39… CTTTTTA… s1 8754 235 170606 v14 226
## 4 9 1 dBr_06… CAGCTAC… s1 1842 206 170606 v14 226
## 5 10 1 dBr_05… GCTTCGA… s1 1980 199 170606 v14 226
## 6 15 1 dBr_16… CGCCCGT… s1 7442 254 170606 v14 226
## 7 23 1 dBr_20… TTAGCCC… s1 3159 416 170606 v14 226
## 8 24 1 dBr_25… TGCACAA… s1 1586 191 170606 v14 226
## 9 26 1 dBr_40… GGATGCC… s1 1465 155 170606 v14 226
## 10 27 1 dBr_21… GTCAGAA… s1 1930 155 170606 v14 226
## # … with 1,258,310 more rows, and 7 more variables: reuse <dbl>,
## # type <chr>, pmt <dbl>, idx <dbl>, target <chr>, condition <chr>,
## # id_idx <chr>
When mutliple assays are specified, the data will be returned as a long tibble.
While we have described how to call broom::tidy
with PBMExperiment objects, more generally, the function can also be applied to any SummarizedExperiment object.
se <- as(hoxc9alexa, "SummarizedExperiment")
broom::tidy(se, long = TRUE)
## # A tibble: 1,889,280 x 16
## Column Row probeID Sequence cname fore date version id reuse
## <int> <int> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 1 GE_Bri… #N/A s1 4446 170606 v14 226 1
## 2 2 1 GE_Bri… #N/A s1 4057 170606 v14 226 1
## 3 3 1 DarkCo… #N/A s1 5326 170606 v14 226 1
## 4 4 1 DarkCo… #N/A s1 5686 170606 v14 226 1
## 5 5 1 Cbf_5b… GAAGCTA… s1 4401 170606 v14 226 1
## 6 6 1 dBr_14… GGTGTGA… s1 4307 170606 v14 226 1
## 7 7 1 dBr_06… CAGTCTA… s1 8101 170606 v14 226 1
## 8 8 1 dBr_39… CTTTTTA… s1 8754 170606 v14 226 1
## 9 9 1 dBr_06… CAGCTAC… s1 1842 170606 v14 226 1
## 10 10 1 dBr_05… GCTTCGA… s1 1980 170606 v14 226 1
## # … with 1,889,270 more rows, and 6 more variables: type <chr>, pmt <dbl>,
## # idx <dbl>, target <chr>, condition <chr>, id_idx <chr>
Notice that when calling broom::tidy
on the SummarizedExperiment object, background probes are not filtered. Similarly, probe sequences are not trimmed. These features are unique to PBMExperiment objects and are lost when converting the data to a SummarizedExperiment object.