Summarising your data in R with across()
Just want to highlight something I learned from the dplyr
R package that
has saved me a lot of copy-and-paste coding!
Problem
dplyr is useful because I can use the group_by()
function to group a dataframe by a specific column, then use summarise()
(or mutate()
, see below) to iterate each of those groupings to perform specific functions.
For instance, with Sepal.Length
in iris
:
# I have already loaded dplyr with library(dplyr)
head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
iris_summary <- iris %>%
group_by(Species) %>%
summarise(Sepal.Length.Mean = mean(Sepal.Length))
iris_summary
## # A tibble: 3 × 2
## Species Sepal.Length.Mean
## <fct> <dbl>
## 1 setosa 5.01
## 2 versicolor 5.94
## 3 virginica 6.59
However, often times I want to calculate multiple summary statistics (mean, median, nth percentile, standard error, standard deviation, etc.) for multiple columns.
Applying transformations across multiple columns
across() makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in “data-masking” functions like summarise() and mutate().
For now let’s work with the first two arguments, .cols
and .fns
.
.cols can take <tidy-select>
columns, so you can give it a specific vector of strings matching the columns you want, or its complement using !
.
iris %>%
group_by(Species) %>%
summarise(across(.cols = c(Sepal.Length, Sepal.Width), # combining selections
.fns = mean))
## # A tibble: 3 × 3
## Species Sepal.Length Sepal.Width
## <fct> <dbl> <dbl>
## 1 setosa 5.01 3.43
## 2 versicolor 5.94 2.77
## 3 virginica 6.59 2.97
iris %>%
group_by(Species) %>%
summarise(across(.cols = !Petal.Length, # the complement of a selection
.fns = mean))
## # A tibble: 3 × 4
## Species Sepal.Length Sepal.Width Petal.Width
## <fct> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 0.246
## 2 versicolor 5.94 2.77 1.33
## 3 virginica 6.59 2.97 2.03
Selection helpers (selecting based on specific conditions)
You can also use functions called selection helpers, such as
starts_with()
and contains()
. Note that by default if you give .fns just
one function, the resulting column retains the same name. See below for
usage of the .names
argument.
iris %>%
group_by(Species) %>%
summarise(across(.cols = starts_with("Sepal"), # select columns starting with "Sepal"
.fns = mean))
## # A tibble: 3 × 3
## Species Sepal.Length Sepal.Width
## <fct> <dbl> <dbl>
## 1 setosa 5.01 3.43
## 2 versicolor 5.94 2.77
## 3 virginica 6.59 2.97
iris %>%
group_by(Species) %>%
# ends_with() would also work in this specific case, so contain() might be more appropriate for string patterns that are in the middle of the column name.
summarise(across(.cols = contains("Length"), # select columns that contain "Length".
.fns = mean))
## # A tibble: 3 × 3
## Species Sepal.Length Petal.Length
## <fct> <dbl> <dbl>
## 1 setosa 5.01 1.46
## 2 versicolor 5.94 4.26
## 3 virginica 6.59 5.55
Multiple transformations per columns
OK, so let’s just say we want to select just the sepal data, and we want
to calculate both the mean and standard deviation for those columns. In
the previous example we directly passed the mean()
function to the .fns
argument, but we can actually give it a named list of functions. With
a named list the syntax is list(name = value)
, so in this case we’ll use
list(Mean = mean, SD = sd)
.
iris %>%
group_by(Species) %>%
summarise(across(.cols = starts_with("Sepal"), # select columns that start with "Sepal".
.fns = list(Mean = mean, SD = sd)))
## # A tibble: 3 × 5
## Species Sepal.Length_Mean Sepal.Length_SD Sepal.Width_Mean Sepal.Width_SD
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 0.352 3.43 0.379
## 2 versicolor 5.94 0.516 2.77 0.314
## 3 virginica 6.59 0.636 2.97 0.322
We’ll see that it automatically appends the name you gave the function to the new columns produced, separated by an underscore.
Defining how the output columns are named
.names The default (NULL) is equivalent to “{.col}” for the single function case and “{.col}_{.fn}” for the case where a list is used for .fns.
So we can tell it how to hand the function names, which should probably
match whatever convention you use for naming columns. I usually use
CamelCase, but here the data separates capitalized words with .
, so I
will match that.
iris %>%
group_by(Species) %>%
summarise(across(.cols = starts_with("Sepal"), # select columns that start with "Sepal".
.fns = list(Mean = mean, SD = sd),
.names = "{.col}.{.fn}"))
## # A tibble: 3 × 5
## Species Sepal.Length.Mean Sepal.Length.SD Sepal.Width.Mean Sepal.Width.SD
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 0.352 3.43 0.379
## 2 versicolor 5.94 0.516 2.77 0.314
## 3 virginica 6.59 0.636 2.97 0.322
This can be useful you if have, say, dozens of columns to perform calculations on, not necessarily summary statistics.
Calculations within the named list
One last thing to mention about the across()
function is that you can use purrr-style
lambdas in your named list. purrr
is useful package also in the
tidyverse that specializes in working with functions and vectors, e.g.,
running the same function on every item in a vector.
The useful part about lambdas is that you can perform mathematical operations within it and also use multiple functions. Standard error calculations are common in biology, so let’s use this as an example.
In a lambda you have to first have a ~ character, before writing out a
formula. Here the contents of the grouped columns then will be passed to wherever
you have .
characters in the formula.
iris %>%
group_by(Species) %>%
summarise(across(.cols = starts_with("Sepal"), # select columns that start with "Sepal".
.fns = list(Mean = mean,
SD = sd,
SE = ~ sd(.)/sqrt(length(.))), # length returns the number of items in the groupings of the column, i.e. per species
.names = "{.col}.{.fn}"))
## # A tibble: 3 × 7
## Species Sepal.Length.Mean Sepal.Length.SD Sepal.L…¹ Sepal…² Sepal…³ Sepal…⁴
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 0.352 0.0498 3.43 0.379 0.0536
## 2 versicolor 5.94 0.516 0.0730 2.77 0.314 0.0444
## 3 virginica 6.59 0.636 0.0899 2.97 0.322 0.0456
## # … with abbreviated variable names ¹Sepal.Length.SE, ²Sepal.Width.Mean,
## # ³Sepal.Width.SD, ⁴Sepal.Width.SE
Keeping the original columns
Finally, you don’t always want just the results of the across()
function. If you want to keep the original columns, you can use mutate()
instead of summarise()
.
iris %>%
group_by(Species) %>%
mutate(across(.cols = starts_with("Sepal"), # select columns that start with "Sepal".
.fns = list(Mean = mean,
SD = sd,
SE = ~ sd(.)/sqrt(length(.))), # length returns the number of items in the groupings of the column, i.e. per species
.names = "{.col}.{.fn}"))
## # A tibble: 150 × 11
## # Groups: Species [3]
## Sepal.Length Sepal.…¹ Petal…² Petal…³ Species Sepal…⁴ Sepal…⁵ Sepal…⁶ Sepal…⁷
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4 0.2 setosa 5.01 0.352 0.0498 3.43
## 2 4.9 3 1.4 0.2 setosa 5.01 0.352 0.0498 3.43
## 3 4.7 3.2 1.3 0.2 setosa 5.01 0.352 0.0498 3.43
## 4 4.6 3.1 1.5 0.2 setosa 5.01 0.352 0.0498 3.43
## 5 5 3.6 1.4 0.2 setosa 5.01 0.352 0.0498 3.43
## 6 5.4 3.9 1.7 0.4 setosa 5.01 0.352 0.0498 3.43
## 7 4.6 3.4 1.4 0.3 setosa 5.01 0.352 0.0498 3.43
## 8 5 3.4 1.5 0.2 setosa 5.01 0.352 0.0498 3.43
## 9 4.4 2.9 1.4 0.2 setosa 5.01 0.352 0.0498 3.43
## 10 4.9 3.1 1.5 0.1 setosa 5.01 0.352 0.0498 3.43
## # … with 140 more rows, 2 more variables: Sepal.Width.SD <dbl>,
## # Sepal.Width.SE <dbl>, and abbreviated variable names ¹Sepal.Width,
## # ²Petal.Length, ³Petal.Width, ⁴Sepal.Length.Mean, ⁵Sepal.Length.SD,
## # ⁶Sepal.Length.SE, ⁷Sepal.Width.Mean