tidy_stat_tbl(
.data,
.x = y,
.fns,
.return_type = "vector",
.use_data_table = FALSE,
...
)Introduction
Many times someone may want to see a summary or cumulative statistic for a given set of data or even from several simulations of data. I went over bootstrap plotting earlier this month, and this is a form of what we will go over today although slightly more restrictive.
I have decided to make today my weekly r-tip because tomorrow is Thanksgiving here in the US and I am taking an extended holiday so I won’t be back until Monday.
Today’s function and weekly tip is on tidy_stat_tbl(). It is meant to be used with a tidy_ distribution function. Let’s take a look.
Function
Here is the function call:
Here are the arguments to the parameters of the function:
.data- The input data coming from a tidy_ distribution function..x- The default is y but can be one of the other columns from the inputdata..fns- The default is IQR, but this can be any stat function like quantile or median etc..return_type- The default is “vector” which returns an sapply object..use_data_table- The default is FALSE, TRUE will use data.table under the hood and still return a tibble. If this argument is set to TRUE then the .return_type parameter will be ignored....- Addition function arguments to be supplied to the parameters of.fns
Examples
Single Simulation
Let’s go over some examples. Firstly, we will go over all the different .return_type’s of a single simulation of tidy_normal() using the quantile function.
Vector Output BE CAREFUL IT USES SAPPLY
library(TidyDensity)
set.seed(123)
tn <- tidy_normal()
tidy_stat_tbl(
.data = tn,
.x = y,
.return_type = "vector",
.fns = quantile,
na.rm = TRUE,
probs = c(0.025, 0.5, 0.975)
) sim_number_1
2.5% -1.59190149
50% -0.07264039
97.5% 1.77074730
List Output with lapply
tidy_stat_tbl(
tn, y, quantile, "list", na.rm = TRUE
)$sim_number_1
0% 25% 50% 75% 100%
-1.96661716 -0.55931702 -0.07264039 0.69817699 2.16895597
tidy_stat_tbl(
tn, y, quantile, "list", na.rm = TRUE,
probs = c(0.025, 0.5, 0.975)
)$sim_number_1
2.5% 50% 97.5%
-1.59190149 -0.07264039 1.77074730
Tibble output with tibble
tidy_stat_tbl(
tn, y, quantile, "tibble", na.rm = TRUE
)# A tibble: 5 × 3
sim_number name quantile
<fct> <chr> <dbl>
1 1 0% -1.97
2 1 25% -0.559
3 1 50% -0.0726
4 1 75% 0.698
5 1 100% 2.17
tidy_stat_tbl(
tn, y, quantile, "tibble", na.rm = TRUE,
probs = c(0.025, 0.5, 0.975)
)# A tibble: 3 × 3
sim_number name quantile
<fct> <chr> <dbl>
1 1 2.5% -1.59
2 1 50% -0.0726
3 1 97.5% 1.77
Tibble output with data.table The output object is a tibble but data.table is used to perform the calculations which can be magnitudes faster when simulations are large. I will showcase down the post.
tidy_stat_tbl(
tn, y, quantile, .use_data_table = TRUE, na.rm = TRUE
)# A tibble: 5 × 3
sim_number name quantile
<fct> <fct> <dbl>
1 1 0% -1.97
2 1 25% -0.559
3 1 50% -0.0726
4 1 75% 0.698
5 1 100% 2.17
tidy_stat_tbl(
tn, y, quantile, .use_data_table = TRUE, na.rm = TRUE,
probs = c(0.025, 0.5, 0.975)
)# A tibble: 3 × 3
sim_number name quantile
<fct> <fct> <dbl>
1 1 2.5% -1.59
2 1 50% -0.0726
3 1 97.5% 1.77
Now let’s take a look with multiple simulations.
Multiple Simulations
Let’s set our simulation count to 5. While this is not a large amount it will serve as a good illustration on the outputs.
ns <- 5
f <- quantile
nr <- TRUE
p <- c(0.025, 0.975)Ok let’s run the same simulations but with the updated params.
Vector Output BE CAREFUL IT USES SAPPLY
set.seed(123)
tn <- tidy_normal(.num_sims = ns)
tidy_stat_tbl(
.data = tn,
.x = y,
.return_type = "vector",
.fns = f,
na.rm = nr,
probs = p
) sim_number_1 sim_number_2 sim_number_3 sim_number_4 sim_number_5
2.5% -1.591901 -1.474945 -1.656679 -1.258156 -1.309749
97.5% 1.770747 1.933653 1.894424 2.098923 1.943384
tidy_stat_tbl(
tn, y, .return_type = "vector",
.fns = f, na.rm = nr
) sim_number_1 sim_number_2 sim_number_3 sim_number_4 sim_number_5
0% -1.96661716 -2.3091689 -2.0532472 -1.31080153 -1.3598407
25% -0.55931702 -0.3612969 -0.9505826 -0.49541417 -0.7140627
50% -0.07264039 0.1525789 -0.3048700 -0.07675993 -0.2240352
75% 0.69817699 0.6294358 0.2900859 0.55145766 0.5287605
100% 2.16895597 2.1873330 2.1001089 3.24103993 2.1988103
List Output with lapply
tidy_stat_tbl(
tn, y, f, "list", na.rm = nr
)$sim_number_1
0% 25% 50% 75% 100%
-1.96661716 -0.55931702 -0.07264039 0.69817699 2.16895597
$sim_number_2
0% 25% 50% 75% 100%
-2.3091689 -0.3612969 0.1525789 0.6294358 2.1873330
$sim_number_3
0% 25% 50% 75% 100%
-2.0532472 -0.9505826 -0.3048700 0.2900859 2.1001089
$sim_number_4
0% 25% 50% 75% 100%
-1.31080153 -0.49541417 -0.07675993 0.55145766 3.24103993
$sim_number_5
0% 25% 50% 75% 100%
-1.3598407 -0.7140627 -0.2240352 0.5287605 2.1988103
tidy_stat_tbl(
tn, y, f, "list", na.rm = nr,
probs = p
)$sim_number_1
2.5% 97.5%
-1.591901 1.770747
$sim_number_2
2.5% 97.5%
-1.474945 1.933653
$sim_number_3
2.5% 97.5%
-1.656679 1.894424
$sim_number_4
2.5% 97.5%
-1.258156 2.098923
$sim_number_5
2.5% 97.5%
-1.309749 1.943384
Tibble output with tibble
tidy_stat_tbl(
tn, y, f, "tibble", na.rm = nr
)# A tibble: 25 × 3
sim_number name f
<fct> <chr> <dbl>
1 1 0% -1.97
2 1 25% -0.559
3 1 50% -0.0726
4 1 75% 0.698
5 1 100% 2.17
6 2 0% -2.31
7 2 25% -0.361
8 2 50% 0.153
9 2 75% 0.629
10 2 100% 2.19
# … with 15 more rows
tidy_stat_tbl(
tn, y, f, "tibble", na.rm = nr,
probs = p
)# A tibble: 10 × 3
sim_number name f
<fct> <chr> <dbl>
1 1 2.5% -1.59
2 1 97.5% 1.77
3 2 2.5% -1.47
4 2 97.5% 1.93
5 3 2.5% -1.66
6 3 97.5% 1.89
7 4 2.5% -1.26
8 4 97.5% 2.10
9 5 2.5% -1.31
10 5 97.5% 1.94
Tibble output with data.table The output object is a tibble but data.table is used to perform the calculations which can be magnitudes faster when simulations are large. I will showcase down the post.
tidy_stat_tbl(
tn, y, f, .use_data_table = TRUE, na.rm = nr
)# A tibble: 25 × 3
sim_number name f
<fct> <fct> <dbl>
1 1 0% -1.97
2 1 25% -0.559
3 1 50% -0.0726
4 1 75% 0.698
5 1 100% 2.17
6 2 0% -2.31
7 2 25% -0.361
8 2 50% 0.153
9 2 75% 0.629
10 2 100% 2.19
# … with 15 more rows
tidy_stat_tbl(
tn, y, f, .use_data_table = TRUE, na.rm = nr,
probs = p
)# A tibble: 10 × 3
sim_number name f
<fct> <fct> <dbl>
1 1 2.5% -1.59
2 1 97.5% 1.77
3 2 2.5% -1.47
4 2 97.5% 1.93
5 3 2.5% -1.66
6 3 97.5% 1.89
7 4 2.5% -1.26
8 4 97.5% 2.10
9 5 2.5% -1.31
10 5 97.5% 1.94
Ok, now that we have shown that, let’s ratchet up the simulations so we can see the true difference in using the .use_data_tbl parameter when simulations are large. We are going to use {rbenchmark} for
Benchmarking
Here we go. We are going to make a tidy_bootstrap() of the mtcars$mpg data which will produce 2000 simulations, we will replicate this 25 times.
library(rbenchmark)
library(TidyDensity)
library(dplyr)
# Get the interesting vector, well for this anyways
x <- mtcars$mpg
# Bootstrap the vector (2k simulations is default)
tb <- tidy_bootstrap(x) %>%
bootstrap_unnest_tbl()
benchmark(
"tibble" = {
tidy_stat_tbl(tb, y, IQR, "tibble")
},
"data.table" = {
tidy_stat_tbl(tb, y, IQR, .use_data_table = TRUE, type = 7)
},
"sapply" = {
tidy_stat_tbl(tb, y, IQR, "vector")
},
"lapply" = {
tidy_stat_tbl(tb, y, IQR, "list")
},
replications = 25,
columns = c("test","replications","elapsed","relative","user.self","sys.self" )
) %>%
arrange(relative) test replications elapsed relative user.self sys.self
1 data.table 25 4.11 1.000 3.33 0.11
2 lapply 25 24.14 5.873 20.02 0.38
3 sapply 25 25.11 6.109 21.01 0.28
4 tibble 25 33.18 8.073 27.45 0.51
Voila!