Summarising your data in R with across()

06 Jul, 2023

Just want to highlight something I learned from the dplyr R package that has saved me a lot of copy-and-paste coding!

Problem

dplyr is useful because I can use the group_by() function to group a dataframe by a specific column, then use summarise() (or mutate(), see below) to iterate each of those groupings to perform specific functions.

For instance, with Sepal.Length in iris:

# I have already loaded dplyr with library(dplyr)
head()

    ##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ## 1          5.1         3.5          1.4         0.2  setosa
    ## 2          4.9         3.0          1.4         0.2  setosa
    ## 3          4.7         3.2          1.3         0.2  setosa
    ## 4          4.6         3.1          1.5         0.2  setosa
    ## 5          5.0         3.6          1.4         0.2  setosa
    ## 6          5.4         3.9          1.7         0.4  setosa

iris_summary <- iris %>%
  group_by(Species) %>%
  summarise(Sepal.Length.Mean = mean(Sepal.Length))
iris_summary

    ## # A tibble: 3 × 2
    ##   Species    Sepal.Length.Mean
    ##   <fct>                  <dbl>
    ## 1 setosa                  5.01
    ## 2 versicolor              5.94
    ## 3 virginica               6.59

However, often times I want to calculate multiple summary statistics (mean, median, nth percentile, standard error, standard deviation, etc.) for multiple columns.

Applying transformations across multiple columns

across() makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in “data-masking” functions like summarise() and mutate().

For now let’s work with the first two arguments, .cols and .fns.

.cols can take <tidy-select> columns, so you can give it a specific vector of strings matching the columns you want, or its complement using !.

iris %>%
  group_by(Species) %>%
  summarise(across(.cols = c(Sepal.Length, Sepal.Width), # combining selections
                   .fns = mean))

    ## # A tibble: 3 × 3
    ##   Species    Sepal.Length Sepal.Width
    ##   <fct>             <dbl>       <dbl>
    ## 1 setosa             5.01        3.43
    ## 2 versicolor         5.94        2.77
    ## 3 virginica          6.59        2.97


iris %>%
  group_by(Species) %>%
  summarise(across(.cols = !Petal.Length, # the complement of a selection
                   .fns = mean))

    ## # A tibble: 3 × 4
    ##   Species    Sepal.Length Sepal.Width Petal.Width
    ##   <fct>             <dbl>       <dbl>       <dbl>
    ## 1 setosa             5.01        3.43       0.246
    ## 2 versicolor         5.94        2.77       1.33
    ## 3 virginica          6.59        2.97       2.03

Selection helpers (selecting based on specific conditions)

You can also use functions called selection helpers, such as starts_with() and contains(). Note that by default if you give .fns just one function, the resulting column retains the same name. See below for usage of the .names argument.

iris %>%
  group_by(Species) %>%
  summarise(across(.cols = starts_with("Sepal"), # select columns starting with "Sepal"
                   .fns = mean))

    ## # A tibble: 3 × 3
    ##   Species    Sepal.Length Sepal.Width
    ##   <fct>             <dbl>       <dbl>
    ## 1 setosa             5.01        3.43
    ## 2 versicolor         5.94        2.77
    ## 3 virginica          6.59        2.97

iris %>%
  group_by(Species) %>%
  # ends_with() would also work in this specific case, so contain() might be more appropriate for string patterns that are in the middle of the column name.
  summarise(across(.cols = contains("Length"), # select columns that contain "Length".
                   .fns = mean))

    ## # A tibble: 3 × 3
    ##   Species    Sepal.Length Petal.Length
    ##   <fct>             <dbl>        <dbl>
    ## 1 setosa             5.01         1.46
    ## 2 versicolor         5.94         4.26
    ## 3 virginica          6.59         5.55

Multiple transformations per columns

OK, so let’s just say we want to select just the sepal data, and we want to calculate both the mean and standard deviation for those columns. In the previous example we directly passed the mean() function to the .fns argument, but we can actually give it a named list of functions. With a named list the syntax is list(name = value), so in this case we’ll use list(Mean = mean, SD = sd).

iris %>%
  group_by(Species) %>%
  summarise(across(.cols = starts_with("Sepal"), # select columns that start with "Sepal".
                   .fns = list(Mean = mean, SD = sd)))

    ## # A tibble: 3 × 5
    ##   Species    Sepal.Length_Mean Sepal.Length_SD Sepal.Width_Mean Sepal.Width_SD
    ##   <fct>                  <dbl>           <dbl>            <dbl>          <dbl>
    ## 1 setosa                  5.01           0.352             3.43          0.379
    ## 2 versicolor              5.94           0.516             2.77          0.314
    ## 3 virginica               6.59           0.636             2.97          0.322

We’ll see that it automatically appends the name you gave the function to the new columns produced, separated by an underscore.

Defining how the output columns are named

.names The default (NULL) is equivalent to “{.col}” for the single function case and “{.col}_{.fn}” for the case where a list is used for .fns.

So we can tell it how to hand the function names, which should probably match whatever convention you use for naming columns. I usually use CamelCase, but here the data separates capitalized words with ., so I will match that.


iris %>%
  group_by(Species) %>%
  summarise(across(.cols = starts_with("Sepal"), # select columns that start with "Sepal".
                   .fns = list(Mean = mean, SD = sd),
                   .names = "{.col}.{.fn}"))

    ## # A tibble: 3 × 5
    ##   Species    Sepal.Length.Mean Sepal.Length.SD Sepal.Width.Mean Sepal.Width.SD
    ##   <fct>                  <dbl>           <dbl>            <dbl>          <dbl>
    ## 1 setosa                  5.01           0.352             3.43          0.379
    ## 2 versicolor              5.94           0.516             2.77          0.314
    ## 3 virginica               6.59           0.636             2.97          0.322

This can be useful you if have, say, dozens of columns to perform calculations on, not necessarily summary statistics.

Calculations within the named list

One last thing to mention about the across() function is that you can use purrr-style lambdas in your named list. purrr is useful package also in the tidyverse that specializes in working with functions and vectors, e.g., running the same function on every item in a vector.

The useful part about lambdas is that you can perform mathematical operations within it and also use multiple functions. Standard error calculations are common in biology, so let’s use this as an example.

In a lambda you have to first have a ~ character, before writing out a formula. Here the contents of the grouped columns then will be passed to wherever you have . characters in the formula.

iris %>%
  group_by(Species) %>%
  summarise(across(.cols = starts_with("Sepal"), # select columns that start with "Sepal".
                   .fns = list(Mean = mean,
                               SD = sd,
                               SE = ~ sd(.)/sqrt(length(.))), # length returns the number of items in the groupings of the column, i.e. per species
                   .names = "{.col}.{.fn}"))

    ## # A tibble: 3 × 7
    ##   Species    Sepal.Length.Mean Sepal.Length.SD Sepal.L…¹ Sepal…² Sepal…³ Sepal…⁴
    ##   <fct>                  <dbl>           <dbl>     <dbl>   <dbl>   <dbl>   <dbl>
    ## 1 setosa                  5.01           0.352    0.0498    3.43   0.379  0.0536
    ## 2 versicolor              5.94           0.516    0.0730    2.77   0.314  0.0444
    ## 3 virginica               6.59           0.636    0.0899    2.97   0.322  0.0456
    ## # … with abbreviated variable names ¹Sepal.Length.SE, ²Sepal.Width.Mean,
    ## #   ³Sepal.Width.SD, ⁴Sepal.Width.SE

Keeping the original columns

Finally, you don’t always want just the results of the across() function. If you want to keep the original columns, you can use mutate() instead of summarise().

iris %>%
  group_by(Species) %>%
  mutate(across(.cols = starts_with("Sepal"), # select columns that start with "Sepal".
                   .fns = list(Mean = mean,
                               SD = sd,
                               SE = ~ sd(.)/sqrt(length(.))), # length returns the number of items in the groupings of the column, i.e. per species
                   .names = "{.col}.{.fn}"))

    ## # A tibble: 150 × 11
    ## # Groups:   Species [3]
    ##    Sepal.Length Sepal.…¹ Petal…² Petal…³ Species Sepal…⁴ Sepal…⁵ Sepal…⁶ Sepal…⁷
    ##           <dbl>    <dbl>   <dbl>   <dbl> <fct>     <dbl>   <dbl>   <dbl>   <dbl>
    ##  1          5.1      3.5     1.4     0.2 setosa     5.01   0.352  0.0498    3.43
    ##  2          4.9      3       1.4     0.2 setosa     5.01   0.352  0.0498    3.43
    ##  3          4.7      3.2     1.3     0.2 setosa     5.01   0.352  0.0498    3.43
    ##  4          4.6      3.1     1.5     0.2 setosa     5.01   0.352  0.0498    3.43
    ##  5          5        3.6     1.4     0.2 setosa     5.01   0.352  0.0498    3.43
    ##  6          5.4      3.9     1.7     0.4 setosa     5.01   0.352  0.0498    3.43
    ##  7          4.6      3.4     1.4     0.3 setosa     5.01   0.352  0.0498    3.43
    ##  8          5        3.4     1.5     0.2 setosa     5.01   0.352  0.0498    3.43
    ##  9          4.4      2.9     1.4     0.2 setosa     5.01   0.352  0.0498    3.43
    ## 10          4.9      3.1     1.5     0.1 setosa     5.01   0.352  0.0498    3.43
    ## # … with 140 more rows, 2 more variables: Sepal.Width.SD <dbl>,
    ## #   Sepal.Width.SE <dbl>, and abbreviated variable names ¹Sepal.Width,
    ## #   ²Petal.Length, ³Petal.Width, ⁴Sepal.Length.Mean, ⁵Sepal.Length.SD,
    ## #   ⁶Sepal.Length.SE, ⁷Sepal.Width.Mean

#Code