# Summary Statistics with {TidyDensity}

code
weeklytip
tidydensity
datatable
Author

Steven P. Sanderson II, MPH

Published

November 23, 2022

# Introduction

Many times someone may want to see a summary or cumulative statistic for a given set of data or even from several simulations of data. I went over bootstrap plotting earlier this month, and this is a form of what we will go over today although slightly more restrictive.

I have decided to make today my weekly r-tip because tomorrow is Thanksgiving here in the US and I am taking an extended holiday so I won’t be back until Monday.

Today’s function and weekly tip is on `tidy_stat_tbl()`. It is meant to be used with a `tidy_` distribution function. Let’s take a look.

# Function

Here is the function call:

``````tidy_stat_tbl(
.data,
.x = y,
.fns,
.return_type = "vector",
.use_data_table = FALSE,
...
)``````

Here are the arguments to the parameters of the function:

• `.data` - The input data coming from a tidy_ distribution function.
• `.x` - The default is y but can be one of the other columns from the input `data.`
• `.fns` - The default is IQR, but this can be any stat function like quantile or median etc.
• `.return_type` - The default is “vector” which returns an sapply object.
• `.use_data_table` - The default is FALSE, TRUE will use data.table under the hood and still return a tibble. If this argument is set to TRUE then the .return_type parameter will be ignored.
• `...` - Addition function arguments to be supplied to the parameters of `.fns`

# Examples

## Single Simulation

Let’s go over some examples. Firstly, we will go over all the different `.return_type`’s of a single simulation of `tidy_normal()` using the `quantile` function.

Vector Output BE CAREFUL IT USES SAPPLY

``````library(TidyDensity)

set.seed(123)
tn <- tidy_normal()

tidy_stat_tbl(
.data = tn,
.x = y,
.return_type = "vector",
.fns = quantile,
na.rm = TRUE,
probs = c(0.025, 0.5, 0.975)
)``````
``````      sim_number_1
2.5%   -1.59190149
50%    -0.07264039
97.5%   1.77074730``````

List Output with lapply

``````tidy_stat_tbl(
tn, y, quantile, "list", na.rm = TRUE
)``````
``````\$sim_number_1
0%         25%         50%         75%        100%
-1.96661716 -0.55931702 -0.07264039  0.69817699  2.16895597 ``````
``````tidy_stat_tbl(
tn, y, quantile, "list", na.rm = TRUE,
probs = c(0.025, 0.5, 0.975)
)``````
``````\$sim_number_1
2.5%         50%       97.5%
-1.59190149 -0.07264039  1.77074730 ``````

Tibble output with tibble

``````tidy_stat_tbl(
tn, y, quantile, "tibble", na.rm = TRUE
)``````
``````# A tibble: 5 × 3
sim_number name  quantile
<fct>      <chr>    <dbl>
1 1          0%     -1.97
2 1          25%    -0.559
3 1          50%    -0.0726
4 1          75%     0.698
5 1          100%    2.17  ``````
``````tidy_stat_tbl(
tn, y, quantile, "tibble", na.rm = TRUE,
probs = c(0.025, 0.5, 0.975)
)``````
``````# A tibble: 3 × 3
sim_number name  quantile
<fct>      <chr>    <dbl>
1 1          2.5%   -1.59
2 1          50%    -0.0726
3 1          97.5%   1.77  ``````

Tibble output with data.table The output object is a `tibble` but `data.table` is used to perform the calculations which can be magnitudes faster when simulations are large. I will showcase down the post.

``````tidy_stat_tbl(
tn, y, quantile, .use_data_table = TRUE, na.rm = TRUE
)``````
``````# A tibble: 5 × 3
sim_number name  quantile
<fct>      <fct>    <dbl>
1 1          0%     -1.97
2 1          25%    -0.559
3 1          50%    -0.0726
4 1          75%     0.698
5 1          100%    2.17  ``````
``````tidy_stat_tbl(
tn, y, quantile, .use_data_table = TRUE, na.rm = TRUE,
probs = c(0.025, 0.5, 0.975)
)``````
``````# A tibble: 3 × 3
sim_number name  quantile
<fct>      <fct>    <dbl>
1 1          2.5%   -1.59
2 1          50%    -0.0726
3 1          97.5%   1.77  ``````

Now let’s take a look with multiple simulations.

## Multiple Simulations

Let’s set our simulation count to 5. While this is not a large amount it will serve as a good illustration on the outputs.

``````ns <- 5
f  <- quantile
nr <- TRUE
p  <- c(0.025, 0.975)``````

Ok let’s run the same simulations but with the updated params.

Vector Output BE CAREFUL IT USES SAPPLY

``````set.seed(123)
tn <- tidy_normal(.num_sims = ns)

tidy_stat_tbl(
.data = tn,
.x = y,
.return_type = "vector",
.fns = f,
na.rm = nr,
probs = p
)``````
``````      sim_number_1 sim_number_2 sim_number_3 sim_number_4 sim_number_5
2.5%     -1.591901    -1.474945    -1.656679    -1.258156    -1.309749
97.5%     1.770747     1.933653     1.894424     2.098923     1.943384``````
``````tidy_stat_tbl(
tn, y, .return_type = "vector",
.fns = f, na.rm = nr
)``````
``````     sim_number_1 sim_number_2 sim_number_3 sim_number_4 sim_number_5
0%    -1.96661716   -2.3091689   -2.0532472  -1.31080153   -1.3598407
25%   -0.55931702   -0.3612969   -0.9505826  -0.49541417   -0.7140627
50%   -0.07264039    0.1525789   -0.3048700  -0.07675993   -0.2240352
75%    0.69817699    0.6294358    0.2900859   0.55145766    0.5287605
100%   2.16895597    2.1873330    2.1001089   3.24103993    2.1988103``````

List Output with lapply

``````tidy_stat_tbl(
tn, y, f, "list", na.rm = nr
)``````
``````\$sim_number_1
0%         25%         50%         75%        100%
-1.96661716 -0.55931702 -0.07264039  0.69817699  2.16895597

\$sim_number_2
0%        25%        50%        75%       100%
-2.3091689 -0.3612969  0.1525789  0.6294358  2.1873330

\$sim_number_3
0%        25%        50%        75%       100%
-2.0532472 -0.9505826 -0.3048700  0.2900859  2.1001089

\$sim_number_4
0%         25%         50%         75%        100%
-1.31080153 -0.49541417 -0.07675993  0.55145766  3.24103993

\$sim_number_5
0%        25%        50%        75%       100%
-1.3598407 -0.7140627 -0.2240352  0.5287605  2.1988103 ``````
``````tidy_stat_tbl(
tn, y, f, "list", na.rm = nr,
probs = p
)``````
``````\$sim_number_1
2.5%     97.5%
-1.591901  1.770747

\$sim_number_2
2.5%     97.5%
-1.474945  1.933653

\$sim_number_3
2.5%     97.5%
-1.656679  1.894424

\$sim_number_4
2.5%     97.5%
-1.258156  2.098923

\$sim_number_5
2.5%     97.5%
-1.309749  1.943384 ``````

Tibble output with tibble

``````tidy_stat_tbl(
tn, y, f, "tibble", na.rm = nr
)``````
``````# A tibble: 25 × 3
sim_number name        f
<fct>      <chr>   <dbl>
1 1          0%    -1.97
2 1          25%   -0.559
3 1          50%   -0.0726
4 1          75%    0.698
5 1          100%   2.17
6 2          0%    -2.31
7 2          25%   -0.361
8 2          50%    0.153
9 2          75%    0.629
10 2          100%   2.19
# … with 15 more rows``````
``````tidy_stat_tbl(
tn, y, f, "tibble", na.rm = nr,
probs = p
)``````
``````# A tibble: 10 × 3
sim_number name      f
<fct>      <chr> <dbl>
1 1          2.5%  -1.59
2 1          97.5%  1.77
3 2          2.5%  -1.47
4 2          97.5%  1.93
5 3          2.5%  -1.66
6 3          97.5%  1.89
7 4          2.5%  -1.26
8 4          97.5%  2.10
9 5          2.5%  -1.31
10 5          97.5%  1.94``````

Tibble output with data.table The output object is a `tibble` but `data.table` is used to perform the calculations which can be magnitudes faster when simulations are large. I will showcase down the post.

``````tidy_stat_tbl(
tn, y, f, .use_data_table = TRUE, na.rm = nr
)``````
``````# A tibble: 25 × 3
sim_number name        f
<fct>      <fct>   <dbl>
1 1          0%    -1.97
2 1          25%   -0.559
3 1          50%   -0.0726
4 1          75%    0.698
5 1          100%   2.17
6 2          0%    -2.31
7 2          25%   -0.361
8 2          50%    0.153
9 2          75%    0.629
10 2          100%   2.19
# … with 15 more rows``````
``````tidy_stat_tbl(
tn, y, f, .use_data_table = TRUE, na.rm = nr,
probs = p
)``````
``````# A tibble: 10 × 3
sim_number name      f
<fct>      <fct> <dbl>
1 1          2.5%  -1.59
2 1          97.5%  1.77
3 2          2.5%  -1.47
4 2          97.5%  1.93
5 3          2.5%  -1.66
6 3          97.5%  1.89
7 4          2.5%  -1.26
8 4          97.5%  2.10
9 5          2.5%  -1.31
10 5          97.5%  1.94``````

Ok, now that we have shown that, let’s ratchet up the simulations so we can see the true difference in using the `.use_data_tbl` parameter when simulations are large. We are going to use `{rbenchmark}` for

## Benchmarking

Here we go. We are going to make a `tidy_bootstrap()` of the `mtcars\$mpg` data which will produce 2000 simulations, we will replicate this 25 times.

``````library(rbenchmark)
library(TidyDensity)
library(dplyr)

# Get the interesting vector, well for this anyways
x <- mtcars\$mpg

# Bootstrap the vector (2k simulations is default)
tb <- tidy_bootstrap(x) %>%
bootstrap_unnest_tbl()

benchmark(
"tibble" = {
tidy_stat_tbl(tb, y, IQR, "tibble")
},
"data.table" = {
tidy_stat_tbl(tb, y, IQR, .use_data_table = TRUE, type = 7)
},
"sapply" = {
tidy_stat_tbl(tb, y, IQR, "vector")
},
"lapply" = {
tidy_stat_tbl(tb, y, IQR, "list")
},
replications = 25,
columns = c("test","replications","elapsed","relative","user.self","sys.self"  )
) %>%
arrange(relative)``````
``````        test replications elapsed relative user.self sys.self
1 data.table           25    4.11    1.000      3.33     0.11
2     lapply           25   24.14    5.873     20.02     0.38
3     sapply           25   25.11    6.109     21.01     0.28
4     tibble           25   33.18    8.073     27.45     0.51``````

Voila!