Aggregating data from multiple fields

This vignette uses the phenoptrExamples sample data and functions from the tidyverse to demonstrate reading and processing cell seg data from multiple fields and samples.

Read multiple data files

Use list_cell_seg_files and purrr::map_df to read all cell seg data files in a single directory into a single data_frame. The result is similar to reading an inForm merge table.

Find all cell seg data files in a directory

list_cell_seg_files takes a directory path as an argument and returns a list of paths to all the cell_seg_data.txt files in a directory.

library(phenoptr)
library(tidyverse)
base_path <- system.file("extdata", "samples", package = "phenoptrExamples")
paths <- list_cell_seg_files(base_path)
length(paths)

## [1] 9

paths[1]

## [1] "C:/Program Files/R/Library/phenoptrExamples/extdata/samples/Set12_20-6plex_[14146,53503]_cell_seg_data.txt"

Read and combine files

purrr::map_df applies read_cell_seg_data to each path in paths. The data_frames returned from each call to read_cell_seg_data are combined row-wise to create a single merged data_frame. The result is similar to an inForm merge data file.

csd <- purrr::map_df(paths, read_cell_seg_data)
dim(csd)

## [1] 54525   199

Using table is one way to summarize the data by Sample Name or Slide ID. This data comes from nine fields taken of three slides.

table(csd$`Sample Name`, csd$Phenotype) %>% addmargins(2, list(Total=sum))

##                                   
##                                    CD68+ CD8+  CK+ FoxP3+ other Total
##   Set12_20-6plex_[14146,53503].im3   557  101 2323    305  2659  5945
##   Set12_20-6plex_[15491,58698].im3    87  123 2367    105  2583  5265
##   Set12_20-6plex_[17241,54367].im3   344   79 3799    174  1709  6105
##   Set4_1-6plex_[11472,51360].im3     599  497 1731    512  4521  7860
##   Set4_1-6plex_[15206,60541].im3     112  349 1434    175  3109  5179
##   Set4_1-6plex_[16142,55840].im3     417  228 2257    228  2942  6072
##   Set8_11-6plex_[13394,50883].im3    590   58 2491    266  2088  5493
##   Set8_11-6plex_[14996,59221].im3    199  197 1100    166  3461  5123
##   Set8_11-6plex_[17130,56449].im3    469  259 1861    611  4283  7483

table(csd$`Slide ID`, csd$Phenotype) %>% addmargins(2, list(Total=sum))

##                 
##                  CD68+  CD8+   CK+ FoxP3+ other Total
##   Set12_20-6plex   988   303  8489    584  6951 17315
##   Set4_1-6plex    1128  1074  5422    915 10572 19111
##   Set8_11-6plex   1258   514  5452   1043  9832 18099

Compute on merged data

Merged data may come from an inForm merge or from combining individual files as shown above. In either case, the data includes entries which come from multiple fields. Computing on merged data requires a few new techniques.

Summarize per slide

Use dplyr::group_by and dplyr::summarize to compute summary statistics for all fields in a slide. For finer grouping, use multiple arguments to group_by. Use dplyr::filter to select a particular phenotype tissue category.

This example computes the mean PDL1 expression for CD68+ and CK+ cells in Tumor, with the mean computed per Slide ID.

csd %>% 
  filter(`Tissue Category`=='Tumor', Phenotype %in% c('CD68+', 'CK+')) %>% 
  group_by(`Slide ID`, Phenotype) %>% 
  summarize(Mean_PDL1=mean(`Entire Cell PDL1 (Opal 520) Mean`))

## # A tibble: 6 x 3
## # Groups:   Slide ID [?]
##   `Slide ID`     Phenotype Mean_PDL1
##   <chr>          <chr>         <dbl>
## 1 Set12_20-6plex CD68+         4.59 
## 2 Set12_20-6plex CK+           0.821
## 3 Set4_1-6plex   CD68+         5.32 
## 4 Set4_1-6plex   CK+           3.03 
## 5 Set8_11-6plex  CD68+         4.01 
## 6 Set8_11-6plex  CK+           1.12

Add distance columns to merged data

Nearest-neighbor distances must be computed per-sample because the X/Y coordinates reported in cell seg data files are all relative to the top-left of the sample.

Use dplyr::group_by to aggregate across subsets of a full data set. In this case, we want to group by Sample Name. Within each group, use dplyr::do to call find_nearest_distance to compute the distance columns and dplyr::bind_cols to combine them with the original data.

# Use the same list of phenotypes for each sample
phenos <- unique(csd$Phenotype)
csd <- csd %>%
  group_by(`Sample Name`) %>%
  do(bind_cols(., find_nearest_distance(., phenos)))
dim(csd)
tail(names(csd), 5)

## [1] 54525   205

## [1] "Distance to other"  "Distance to CD68+"  "Distance to FoxP3+"
## [4] "Distance to CK+"    "Distance to CD8+"

Average distance per sample

The next example uses group_by, filter and summarize again to compute the average distance from a tumor cell (CK+) to the nearest macrophage (CD68+), with the averages computed per Slide ID.

csd %>% group_by(`Slide ID`) %>% 
  filter(Phenotype=='CK+') %>% # Only tumor cells
  summarize(mean_dist_to_CD68=round(mean(`Distance to CD68+`), 2))

## # A tibble: 3 x 2
##   `Slide ID`     mean_dist_to_CD68
##   <chr>                      <dbl>
## 1 Set12_20-6plex              56.2
## 2 Set4_1-6plex                41.8
## 3 Set8_11-6plex               41.0

Compute `count_within` for each field in merged data

count_within is another function that must be computed per field.

This example uses dplyr::group_by and dplyr::do to call count_within for each field in a merged data file. The result is a data_frame with one row per radius per field.

Including Slide ID in the group_by arguments doesn’t change the grouping, it causes Slide ID to be included in the result. This is helpful for further aggregation.

See the section Aggregate counts and means per slide below for an example which aggregates counts per slide.

csd %>% group_by(`Slide ID`, `Sample Name`) %>% 
  do(count_within(., from='CK+', to='CD68+', radius=15))

## # A tibble: 9 x 7
## # Groups:   Slide ID, Sample Name [9]
##   `Slide ID`     `Sample Name`                    radius from_count
##   <chr>          <chr>                             <dbl>      <int>
## 1 Set12_20-6plex Set12_20-6plex_[14146,53503].im3     15       2323
## 2 Set12_20-6plex Set12_20-6plex_[15491,58698].im3     15       2367
## 3 Set12_20-6plex Set12_20-6plex_[17241,54367].im3     15       3799
## 4 Set4_1-6plex   Set4_1-6plex_[11472,51360].im3       15       1731
## 5 Set4_1-6plex   Set4_1-6plex_[15206,60541].im3       15       1434
## 6 Set4_1-6plex   Set4_1-6plex_[16142,55840].im3       15       2257
## 7 Set8_11-6plex  Set8_11-6plex_[13394,50883].im3      15       2491
## 8 Set8_11-6plex  Set8_11-6plex_[14996,59221].im3      15       1100
## 9 Set8_11-6plex  Set8_11-6plex_[17130,56449].im3      15       1861
##   to_count from_with within_mean
##      <int>     <int>       <dbl>
## 1      557       273      0.169 
## 2       87       104      0.0503
## 3      344       166      0.0598
## 4      599       387      0.308 
## 5      112        67      0.0530
## 6      417       253      0.158 
## 7      590       261      0.161 
## 8      199        90      0.0918
## 9      469       181      0.134

Aggregate `count_within` across samples

Compute counts and averages

Use count_within_batch to count cells within a radius for multiple tissue categories, phenotypes and fields when the fields have not been merged.

This example counts CK+ cells having a CD8+ cell within 10 or 25 microns, and CK+ cells with a CD68+ cell within 10 or 25 microns. dplyr::glimpse gives a compact summary of the data.

base_path <- system.file("extdata", "samples", package = "phenoptrExamples")
pairs <- list(c('CK+', 'CD8+'),
             c('CK+', 'CD68+'))
radius <- c(10, 25)
counts <- count_within_batch(base_path, pairs, radius, verbose=FALSE) %>% 
  select(-source, -category) # Remove unneeded columns

glimpse(counts)

## Observations: 36
## Variables: 8
## $ slide_id    <chr> "Set12_20-6plex", "Set12_20-6plex", "Set12_20-6ple...
## $ from        <chr> "CK+", "CK+", "CK+", "CK+", "CK+", "CK+", "CK+", "...
## $ to          <chr> "CD8+", "CD8+", "CD68+", "CD68+", "CD8+", "CD8+", ...
## $ radius      <int> 10, 25, 10, 25, 10, 25, 10, 25, 10, 25, 10, 25, 10...
## $ from_count  <int> 2323, 2323, 2323, 2323, 2367, 2367, 2367, 2367, 37...
## $ to_count    <int> 101, 101, 557, 557, 123, 123, 87, 87, 79, 79, 344,...
## $ from_with   <int> 33, 250, 99, 647, 71, 528, 47, 301, 36, 284, 54, 5...
## $ within_mean <dbl> 0.015927680, 0.136461472, 0.049504950, 0.613861386...

Aggregate counts and means per slide

Aggregating from_count, to_count and from_with across fields is straightforward, it only requires simple sums. Aggregating within_mean requires computing the underlying count of cells within the radius, summing, and computing a new mean.

(Note: the value of from_count * within_mean is not reported by count_with because it may count cells multiple times.)

counts_per_sample <- counts %>% group_by(slide_id, from, to, radius) %>% 
    summarize(from_count=sum(from_count),
              to_count=sum(to_count),
              from_with=sum(from_with),
              within=sum(from_count*within_mean),
              within_mean=within/from_count) %>%
  ungroup %>% select(-within)
counts_per_sample

## # A tibble: 12 x 8
##    slide_id       from  to    radius from_count to_count from_with
##    <chr>          <chr> <chr>  <int>      <int>    <int>     <int>
##  1 Set12_20-6plex CK+   CD68+     10       8489      988       200
##  2 Set12_20-6plex CK+   CD68+     25       8489      988      1482
##  3 Set12_20-6plex CK+   CD8+      10       8489      303       140
##  4 Set12_20-6plex CK+   CD8+      25       8489      303      1062
##  5 Set4_1-6plex   CK+   CD68+     10       5422     1128       288
##  6 Set4_1-6plex   CK+   CD68+     25       5422     1128      1775
##  7 Set4_1-6plex   CK+   CD8+      10       5422     1074       344
##  8 Set4_1-6plex   CK+   CD8+      25       5422     1074      2003
##  9 Set8_11-6plex  CK+   CD68+     10       5452     1258       170
## 10 Set8_11-6plex  CK+   CD68+     25       5452     1258      1484
## 11 Set8_11-6plex  CK+   CD8+      10       5452      514       123
## 12 Set8_11-6plex  CK+   CD8+      25       5452      514       832
##    within_mean
##          <dbl>
##  1      0.0872
##  2      1.03  
##  3      0.0590
##  4      0.498 
##  5      0.173 
##  6      1.84  
##  7      0.239 
##  8      2.21  
##  9      0.0934
## 10      1.51  
## 11      0.0980
## 12      0.871

Kent Johnson

2018-10-30

Read multiple data files

Find all cell seg data files in a directory

Read and combine files

Compute on merged data

Summarize per slide

Add distance columns to merged data

Average distance per sample

Compute `count_within` for each field in merged data

Aggregate `count_within` across samples

Compute counts and averages

Aggregate counts and means per slide

Contents

Aggregating data from multiple fields

Kent Johnson

2018-10-30

Read multiple data files

Find all cell seg data files in a directory

Read and combine files

Compute on merged data

Summarize per slide

Add distance columns to merged data

Average distance per sample

Compute count_within for each field in merged data

Aggregate count_within across samples

Compute counts and averages

Aggregate counts and means per slide

Contents

Compute `count_within` for each field in merged data

Aggregate `count_within` across samples