Summary Statistics with {TidyDensity}

code
weeklytip
tidydensity
datatable
Author

Steven P. Sanderson II, MPH

Published

November 23, 2022

Introduction

Many times someone may want to see a summary or cumulative statistic for a given set of data or even from several simulations of data. I went over bootstrap plotting earlier this month, and this is a form of what we will go over today although slightly more restrictive.

I have decided to make today my weekly r-tip because tomorrow is Thanksgiving here in the US and I am taking an extended holiday so I won’t be back until Monday.

Today’s function and weekly tip is on tidy_stat_tbl(). It is meant to be used with a tidy_ distribution function. Let’s take a look.

Function

Here is the function call:

tidy_stat_tbl(
  .data,
  .x = y,
  .fns,
  .return_type = "vector",
  .use_data_table = FALSE,
  ...
)

Here are the arguments to the parameters of the function:

  • .data - The input data coming from a tidy_ distribution function.
  • .x - The default is y but can be one of the other columns from the input data.
  • .fns - The default is IQR, but this can be any stat function like quantile or median etc.
  • .return_type - The default is “vector” which returns an sapply object.
  • .use_data_table - The default is FALSE, TRUE will use data.table under the hood and still return a tibble. If this argument is set to TRUE then the .return_type parameter will be ignored.
  • ... - Addition function arguments to be supplied to the parameters of .fns

Examples

Single Simulation

Let’s go over some examples. Firstly, we will go over all the different .return_type’s of a single simulation of tidy_normal() using the quantile function.

Vector Output BE CAREFUL IT USES SAPPLY

library(TidyDensity)

set.seed(123)
tn <- tidy_normal()

tidy_stat_tbl(
  .data = tn,
  .x = y,
  .return_type = "vector",
  .fns = quantile,
  na.rm = TRUE,
  probs = c(0.025, 0.5, 0.975)
  )
      sim_number_1
2.5%   -1.59190149
50%    -0.07264039
97.5%   1.77074730

List Output with lapply

tidy_stat_tbl(
  tn, y, quantile, "list", na.rm = TRUE
)
$sim_number_1
         0%         25%         50%         75%        100% 
-1.96661716 -0.55931702 -0.07264039  0.69817699  2.16895597 
tidy_stat_tbl(
  tn, y, quantile, "list", na.rm = TRUE, 
  probs = c(0.025, 0.5, 0.975)
)
$sim_number_1
       2.5%         50%       97.5% 
-1.59190149 -0.07264039  1.77074730 

Tibble output with tibble

tidy_stat_tbl(
  tn, y, quantile, "tibble", na.rm = TRUE
)
# A tibble: 5 × 3
  sim_number name  quantile
  <fct>      <chr>    <dbl>
1 1          0%     -1.97  
2 1          25%    -0.559 
3 1          50%    -0.0726
4 1          75%     0.698 
5 1          100%    2.17  
tidy_stat_tbl(
  tn, y, quantile, "tibble", na.rm = TRUE, 
  probs = c(0.025, 0.5, 0.975)
)
# A tibble: 3 × 3
  sim_number name  quantile
  <fct>      <chr>    <dbl>
1 1          2.5%   -1.59  
2 1          50%    -0.0726
3 1          97.5%   1.77  

Tibble output with data.table The output object is a tibble but data.table is used to perform the calculations which can be magnitudes faster when simulations are large. I will showcase down the post.

tidy_stat_tbl(
  tn, y, quantile, .use_data_table = TRUE, na.rm = TRUE
)
# A tibble: 5 × 3
  sim_number name  quantile
  <fct>      <fct>    <dbl>
1 1          0%     -1.97  
2 1          25%    -0.559 
3 1          50%    -0.0726
4 1          75%     0.698 
5 1          100%    2.17  
tidy_stat_tbl(
  tn, y, quantile, .use_data_table = TRUE, na.rm = TRUE, 
  probs = c(0.025, 0.5, 0.975)
)
# A tibble: 3 × 3
  sim_number name  quantile
  <fct>      <fct>    <dbl>
1 1          2.5%   -1.59  
2 1          50%    -0.0726
3 1          97.5%   1.77  

Now let’s take a look with multiple simulations.

Multiple Simulations

Let’s set our simulation count to 5. While this is not a large amount it will serve as a good illustration on the outputs.

ns <- 5
f  <- quantile
nr <- TRUE
p  <- c(0.025, 0.975)

Ok let’s run the same simulations but with the updated params.

Vector Output BE CAREFUL IT USES SAPPLY

set.seed(123)
tn <- tidy_normal(.num_sims = ns)

tidy_stat_tbl(
  .data = tn,
  .x = y,
  .return_type = "vector",
  .fns = f,
  na.rm = nr,
  probs = p
  )
      sim_number_1 sim_number_2 sim_number_3 sim_number_4 sim_number_5
2.5%     -1.591901    -1.474945    -1.656679    -1.258156    -1.309749
97.5%     1.770747     1.933653     1.894424     2.098923     1.943384
tidy_stat_tbl(
  tn, y, .return_type = "vector",
  .fns = f, na.rm = nr
)
     sim_number_1 sim_number_2 sim_number_3 sim_number_4 sim_number_5
0%    -1.96661716   -2.3091689   -2.0532472  -1.31080153   -1.3598407
25%   -0.55931702   -0.3612969   -0.9505826  -0.49541417   -0.7140627
50%   -0.07264039    0.1525789   -0.3048700  -0.07675993   -0.2240352
75%    0.69817699    0.6294358    0.2900859   0.55145766    0.5287605
100%   2.16895597    2.1873330    2.1001089   3.24103993    2.1988103

List Output with lapply

tidy_stat_tbl(
  tn, y, f, "list", na.rm = nr
)
$sim_number_1
         0%         25%         50%         75%        100% 
-1.96661716 -0.55931702 -0.07264039  0.69817699  2.16895597 

$sim_number_2
        0%        25%        50%        75%       100% 
-2.3091689 -0.3612969  0.1525789  0.6294358  2.1873330 

$sim_number_3
        0%        25%        50%        75%       100% 
-2.0532472 -0.9505826 -0.3048700  0.2900859  2.1001089 

$sim_number_4
         0%         25%         50%         75%        100% 
-1.31080153 -0.49541417 -0.07675993  0.55145766  3.24103993 

$sim_number_5
        0%        25%        50%        75%       100% 
-1.3598407 -0.7140627 -0.2240352  0.5287605  2.1988103 
tidy_stat_tbl(
  tn, y, f, "list", na.rm = nr, 
  probs = p
)
$sim_number_1
     2.5%     97.5% 
-1.591901  1.770747 

$sim_number_2
     2.5%     97.5% 
-1.474945  1.933653 

$sim_number_3
     2.5%     97.5% 
-1.656679  1.894424 

$sim_number_4
     2.5%     97.5% 
-1.258156  2.098923 

$sim_number_5
     2.5%     97.5% 
-1.309749  1.943384 

Tibble output with tibble

tidy_stat_tbl(
  tn, y, f, "tibble", na.rm = nr
)
# A tibble: 25 × 3
   sim_number name        f
   <fct>      <chr>   <dbl>
 1 1          0%    -1.97  
 2 1          25%   -0.559 
 3 1          50%   -0.0726
 4 1          75%    0.698 
 5 1          100%   2.17  
 6 2          0%    -2.31  
 7 2          25%   -0.361 
 8 2          50%    0.153 
 9 2          75%    0.629 
10 2          100%   2.19  
# … with 15 more rows
tidy_stat_tbl(
  tn, y, f, "tibble", na.rm = nr, 
  probs = p
)
# A tibble: 10 × 3
   sim_number name      f
   <fct>      <chr> <dbl>
 1 1          2.5%  -1.59
 2 1          97.5%  1.77
 3 2          2.5%  -1.47
 4 2          97.5%  1.93
 5 3          2.5%  -1.66
 6 3          97.5%  1.89
 7 4          2.5%  -1.26
 8 4          97.5%  2.10
 9 5          2.5%  -1.31
10 5          97.5%  1.94

Tibble output with data.table The output object is a tibble but data.table is used to perform the calculations which can be magnitudes faster when simulations are large. I will showcase down the post.

tidy_stat_tbl(
  tn, y, f, .use_data_table = TRUE, na.rm = nr
)
# A tibble: 25 × 3
   sim_number name        f
   <fct>      <fct>   <dbl>
 1 1          0%    -1.97  
 2 1          25%   -0.559 
 3 1          50%   -0.0726
 4 1          75%    0.698 
 5 1          100%   2.17  
 6 2          0%    -2.31  
 7 2          25%   -0.361 
 8 2          50%    0.153 
 9 2          75%    0.629 
10 2          100%   2.19  
# … with 15 more rows
tidy_stat_tbl(
  tn, y, f, .use_data_table = TRUE, na.rm = nr, 
  probs = p
)
# A tibble: 10 × 3
   sim_number name      f
   <fct>      <fct> <dbl>
 1 1          2.5%  -1.59
 2 1          97.5%  1.77
 3 2          2.5%  -1.47
 4 2          97.5%  1.93
 5 3          2.5%  -1.66
 6 3          97.5%  1.89
 7 4          2.5%  -1.26
 8 4          97.5%  2.10
 9 5          2.5%  -1.31
10 5          97.5%  1.94

Ok, now that we have shown that, let’s ratchet up the simulations so we can see the true difference in using the .use_data_tbl parameter when simulations are large. We are going to use {rbenchmark} for

Benchmarking

Here we go. We are going to make a tidy_bootstrap() of the mtcars$mpg data which will produce 2000 simulations, we will replicate this 25 times.

library(rbenchmark)
library(TidyDensity)
library(dplyr)

# Get the interesting vector, well for this anyways
x <- mtcars$mpg

# Bootstrap the vector (2k simulations is default)
tb <- tidy_bootstrap(x) %>%
  bootstrap_unnest_tbl()

benchmark(
  "tibble" = {
    tidy_stat_tbl(tb, y, IQR, "tibble")
  },
  "data.table" = {
    tidy_stat_tbl(tb, y, IQR, .use_data_table = TRUE, type = 7)
  },
  "sapply" = {
    tidy_stat_tbl(tb, y, IQR, "vector")
  },
  "lapply" = {
    tidy_stat_tbl(tb, y, IQR, "list")
  },
  replications = 25,
  columns = c("test","replications","elapsed","relative","user.self","sys.self"  )
) %>%
  arrange(relative)
        test replications elapsed relative user.self sys.self
1 data.table           25    4.11    1.000      3.33     0.11
2     lapply           25   24.14    5.873     20.02     0.38
3     sapply           25   25.11    6.109     21.01     0.28
4     tibble           25   33.18    8.073     27.45     0.51

Voila!