vignettes/aggregation.Rmd
aggregation.Rmd
This vignette uses the phenoptrExamples
sample data and functions from the tidyverse to demonstrate reading and processing cell seg data from multiple fields and samples.
Use list_cell_seg_files
and purrr::map_df
to read all cell seg data files in a single directory into a single data_frame
. The result is similar to reading an inForm merge table.
list_cell_seg_files
takes a directory path as an argument and returns a list of paths to all the cell_seg_data.txt
files in a directory.
library(phenoptr)
library(tidyverse)
base_path <- system.file("extdata", "samples", package = "phenoptrExamples")
paths <- list_cell_seg_files(base_path)
length(paths)
## [1] 9
paths[1]
## [1] "C:/Program Files/R/Library/phenoptrExamples/extdata/samples/Set12_20-6plex_[14146,53503]_cell_seg_data.txt"
purrr::map_df
applies read_cell_seg_data
to each path in paths
. The data_frame
s returned from each call to read_cell_seg_data
are combined row-wise to create a single merged data_frame
. The result is similar to an inForm merge data file.
csd <- purrr::map_df(paths, read_cell_seg_data)
dim(csd)
## [1] 54525 199
Using table
is one way to summarize the data by Sample Name
or Slide ID
. This data comes from nine fields taken of three slides.
table(csd$`Sample Name`, csd$Phenotype) %>% addmargins(2, list(Total=sum))
##
## CD68+ CD8+ CK+ FoxP3+ other Total
## Set12_20-6plex_[14146,53503].im3 557 101 2323 305 2659 5945
## Set12_20-6plex_[15491,58698].im3 87 123 2367 105 2583 5265
## Set12_20-6plex_[17241,54367].im3 344 79 3799 174 1709 6105
## Set4_1-6plex_[11472,51360].im3 599 497 1731 512 4521 7860
## Set4_1-6plex_[15206,60541].im3 112 349 1434 175 3109 5179
## Set4_1-6plex_[16142,55840].im3 417 228 2257 228 2942 6072
## Set8_11-6plex_[13394,50883].im3 590 58 2491 266 2088 5493
## Set8_11-6plex_[14996,59221].im3 199 197 1100 166 3461 5123
## Set8_11-6plex_[17130,56449].im3 469 259 1861 611 4283 7483
table(csd$`Slide ID`, csd$Phenotype) %>% addmargins(2, list(Total=sum))
##
## CD68+ CD8+ CK+ FoxP3+ other Total
## Set12_20-6plex 988 303 8489 584 6951 17315
## Set4_1-6plex 1128 1074 5422 915 10572 19111
## Set8_11-6plex 1258 514 5452 1043 9832 18099
Merged data may come from an inForm merge or from combining individual files as shown above. In either case, the data includes entries which come from multiple fields. Computing on merged data requires a few new techniques.
Use dplyr::group_by
and dplyr::summarize
to compute summary statistics for all fields in a slide. For finer grouping, use multiple arguments to group_by
. Use dplyr::filter
to select a particular phenotype tissue category.
This example computes the mean PDL1 expression for CD68+
and CK+
cells in Tumor
, with the mean computed per Slide ID
.
csd %>%
filter(`Tissue Category`=='Tumor', Phenotype %in% c('CD68+', 'CK+')) %>%
group_by(`Slide ID`, Phenotype) %>%
summarize(Mean_PDL1=mean(`Entire Cell PDL1 (Opal 520) Mean`))
## # A tibble: 6 x 3
## # Groups: Slide ID [?]
## `Slide ID` Phenotype Mean_PDL1
## <chr> <chr> <dbl>
## 1 Set12_20-6plex CD68+ 4.59
## 2 Set12_20-6plex CK+ 0.821
## 3 Set4_1-6plex CD68+ 5.32
## 4 Set4_1-6plex CK+ 3.03
## 5 Set8_11-6plex CD68+ 4.01
## 6 Set8_11-6plex CK+ 1.12
Nearest-neighbor distances must be computed per-sample because the X/Y coordinates reported in cell seg data files are all relative to the top-left of the sample.
Use dplyr::group_by
to aggregate across subsets of a full data set. In this case, we want to group by Sample Name
. Within each group, use dplyr::do
to call find_nearest_distance
to compute the distance columns and dplyr::bind_cols
to combine them with the original data.
# Use the same list of phenotypes for each sample
phenos <- unique(csd$Phenotype)
csd <- csd %>%
group_by(`Sample Name`) %>%
do(bind_cols(., find_nearest_distance(., phenos)))
dim(csd)
tail(names(csd), 5)
## [1] 54525 205
## [1] "Distance to other" "Distance to CD68+" "Distance to FoxP3+"
## [4] "Distance to CK+" "Distance to CD8+"
The next example uses group_by
, filter
and summarize
again to compute the average distance from a tumor cell (CK+
) to the nearest macrophage (CD68+
), with the averages computed per Slide ID
.
csd %>% group_by(`Slide ID`) %>%
filter(Phenotype=='CK+') %>% # Only tumor cells
summarize(mean_dist_to_CD68=round(mean(`Distance to CD68+`), 2))
## # A tibble: 3 x 2
## `Slide ID` mean_dist_to_CD68
## <chr> <dbl>
## 1 Set12_20-6plex 56.2
## 2 Set4_1-6plex 41.8
## 3 Set8_11-6plex 41.0
count_within
for each field in merged datacount_within
is another function that must be computed per field.
This example uses dplyr::group_by
and dplyr::do
to call count_within
for each field in a merged data file. The result is a data_frame
with one row per radius per field.
Including Slide ID
in the group_by
arguments doesn’t change the grouping, it causes Slide ID
to be included in the result. This is helpful for further aggregation.
See the section Aggregate counts and means per slide below for an example which aggregates counts per slide.
csd %>% group_by(`Slide ID`, `Sample Name`) %>%
do(count_within(., from='CK+', to='CD68+', radius=15))
## # A tibble: 9 x 7
## # Groups: Slide ID, Sample Name [9]
## `Slide ID` `Sample Name` radius from_count
## <chr> <chr> <dbl> <int>
## 1 Set12_20-6plex Set12_20-6plex_[14146,53503].im3 15 2323
## 2 Set12_20-6plex Set12_20-6plex_[15491,58698].im3 15 2367
## 3 Set12_20-6plex Set12_20-6plex_[17241,54367].im3 15 3799
## 4 Set4_1-6plex Set4_1-6plex_[11472,51360].im3 15 1731
## 5 Set4_1-6plex Set4_1-6plex_[15206,60541].im3 15 1434
## 6 Set4_1-6plex Set4_1-6plex_[16142,55840].im3 15 2257
## 7 Set8_11-6plex Set8_11-6plex_[13394,50883].im3 15 2491
## 8 Set8_11-6plex Set8_11-6plex_[14996,59221].im3 15 1100
## 9 Set8_11-6plex Set8_11-6plex_[17130,56449].im3 15 1861
## to_count from_with within_mean
## <int> <int> <dbl>
## 1 557 273 0.169
## 2 87 104 0.0503
## 3 344 166 0.0598
## 4 599 387 0.308
## 5 112 67 0.0530
## 6 417 253 0.158
## 7 590 261 0.161
## 8 199 90 0.0918
## 9 469 181 0.134
count_within
across samplesUse count_within_batch
to count cells within a radius for multiple tissue categories, phenotypes and fields when the fields have not been merged.
This example counts CK+
cells having a CD8+
cell within 10 or 25 microns, and CK+
cells with a CD68+
cell within 10 or 25 microns. dplyr::glimpse
gives a compact summary of the data.
base_path <- system.file("extdata", "samples", package = "phenoptrExamples")
pairs <- list(c('CK+', 'CD8+'),
c('CK+', 'CD68+'))
radius <- c(10, 25)
counts <- count_within_batch(base_path, pairs, radius, verbose=FALSE) %>%
select(-source, -category) # Remove unneeded columns
glimpse(counts)
## Observations: 36
## Variables: 8
## $ slide_id <chr> "Set12_20-6plex", "Set12_20-6plex", "Set12_20-6ple...
## $ from <chr> "CK+", "CK+", "CK+", "CK+", "CK+", "CK+", "CK+", "...
## $ to <chr> "CD8+", "CD8+", "CD68+", "CD68+", "CD8+", "CD8+", ...
## $ radius <int> 10, 25, 10, 25, 10, 25, 10, 25, 10, 25, 10, 25, 10...
## $ from_count <int> 2323, 2323, 2323, 2323, 2367, 2367, 2367, 2367, 37...
## $ to_count <int> 101, 101, 557, 557, 123, 123, 87, 87, 79, 79, 344,...
## $ from_with <int> 33, 250, 99, 647, 71, 528, 47, 301, 36, 284, 54, 5...
## $ within_mean <dbl> 0.015927680, 0.136461472, 0.049504950, 0.613861386...
Aggregating from_count
, to_count
and from_with
across fields is straightforward, it only requires simple sums. Aggregating within_mean
requires computing the underlying count of cells within the radius, summing, and computing a new mean.
(Note: the value of from_count * within_mean
is not reported by count_with
because it may count cells multiple times.)
counts_per_sample <- counts %>% group_by(slide_id, from, to, radius) %>%
summarize(from_count=sum(from_count),
to_count=sum(to_count),
from_with=sum(from_with),
within=sum(from_count*within_mean),
within_mean=within/from_count) %>%
ungroup %>% select(-within)
counts_per_sample
## # A tibble: 12 x 8
## slide_id from to radius from_count to_count from_with
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Set12_20-6plex CK+ CD68+ 10 8489 988 200
## 2 Set12_20-6plex CK+ CD68+ 25 8489 988 1482
## 3 Set12_20-6plex CK+ CD8+ 10 8489 303 140
## 4 Set12_20-6plex CK+ CD8+ 25 8489 303 1062
## 5 Set4_1-6plex CK+ CD68+ 10 5422 1128 288
## 6 Set4_1-6plex CK+ CD68+ 25 5422 1128 1775
## 7 Set4_1-6plex CK+ CD8+ 10 5422 1074 344
## 8 Set4_1-6plex CK+ CD8+ 25 5422 1074 2003
## 9 Set8_11-6plex CK+ CD68+ 10 5452 1258 170
## 10 Set8_11-6plex CK+ CD68+ 25 5452 1258 1484
## 11 Set8_11-6plex CK+ CD8+ 10 5452 514 123
## 12 Set8_11-6plex CK+ CD8+ 25 5452 514 832
## within_mean
## <dbl>
## 1 0.0872
## 2 1.03
## 3 0.0590
## 4 0.498
## 5 0.173
## 6 1.84
## 7 0.239
## 8 2.21
## 9 0.0934
## 10 1.51
## 11 0.0980
## 12 0.871