Distribution Summaries with {TidyDensity}

code
rtip
tidydensity
Author

Steven P. Sanderson II, MPH

Published

December 14, 2022

Introduction

{TidyDensity} is an R package that provides tools for working with probability distributions in a tidy data format. One of the key functions in the package is tidy_distribution_summary_tbl(), which allows users to quickly and easily get summary information about a probability distribution.

The tidy_distribution_summary_tbl() function takes a vector of data as input and returns a table with basic statistics about the distribution of the data. This includes the mean, standard deviation, kurtosis, and skewness of the data, as well as other useful information.

Using tidy_distribution_summary_tbl(), users can easily get a high-level overview of their data, which can be useful for exploratory data analysis, data visualization, and other tasks. The function is designed to work seamlessly with the other tools in the {TidyDensity} package, making it easy to combine with other operations and build complex data analysis pipelines.

Overall, TidyDensity and its tidy_distribution_summary_tbl() function are valuable tools for anyone working with probability distributions in R. Whether you are a seasoned data scientist or a beginner, TidyDensity can help you quickly and easily explore and understand your data.

Function

Let’s take a look at the full function call.

tidy_distribution_summary_tbl(.data, ...)

Here are the arguments that go to the parameters.

  • .data - The data that is going to be passed from a a tidy_ distribution function.
  • ... - This is the grouping variable that gets passed to dplyr::group_by() and dplyr::select().

Example

Now let’s go over a simple example.

library(TidyDensity)
library(dplyr)

tn <- tidy_normal(.num_sims = 5)
tb <- tidy_beta(.num_sims = 5)

tidy_distribution_summary_tbl(tn) |>
  glimpse()
Rows: 1
Columns: 12
$ mean_val   <dbl> -0.044964
$ median_val <dbl> -0.0266966
$ std_val    <dbl> 1.020322
$ min_val    <dbl> -2.834123
$ max_val    <dbl> 3.336879
$ skewness   <dbl> 0.03115634
$ kurtosis   <dbl> 2.772527
$ range      <dbl> 6.171002
$ iqr        <dbl> 1.447849
$ variance   <dbl> 1.041057
$ ci_low     <dbl> -1.873091
$ ci_high    <dbl> 1.868382
tidy_distribution_summary_tbl(tn, sim_number) |>
  glimpse()
Rows: 5
Columns: 13
$ sim_number <fct> 1, 2, 3, 4, 5
$ mean_val   <dbl> -0.09684833, -0.13886169, 0.23257556, -0.32487778, 0.103192…
$ median_val <dbl> -0.1358051, -0.2550682, 0.3069263, -0.1334922, 0.2898412
$ std_val    <dbl> 1.1231699, 1.0954659, 0.8902380, 0.9270631, 0.9919932
$ min_val    <dbl> -2.834123, -2.340575, -1.963215, -2.396105, -1.827744
$ max_val    <dbl> 3.336879, 1.987640, 2.066451, 1.526231, 2.093211
$ skewness   <dbl> 0.352771389, 0.132723834, -0.282840344, -0.191853538, 0.006…
$ kurtosis   <dbl> 3.652828, 2.169309, 2.749967, 2.332081, 2.409223
$ range      <dbl> 6.171002, 4.328215, 4.029666, 3.922336, 3.920956
$ iqr        <dbl> 1.5256470, 1.6335396, 0.9368546, 1.3968485, 1.3469671
$ variance   <dbl> 1.2615106, 1.2000455, 0.7925236, 0.8594460, 0.9840505
$ ci_low     <dbl> -1.834548, -1.844197, -1.428713, -2.193065, -1.626225
$ ci_high    <dbl> 1.860755, 1.858576, 1.644153, 1.090125, 1.976371
data_tbl <- tidy_combine_distributions(tn, tb)

tidy_distribution_summary_tbl(data_tbl) |>
  glimpse()
Rows: 1
Columns: 12
$ mean_val   <dbl> 0.2413251
$ median_val <dbl> 0.3687409
$ std_val    <dbl> 0.8030476
$ min_val    <dbl> -2.834123
$ max_val    <dbl> 3.336879
$ skewness   <dbl> -0.7608556
$ kurtosis   <dbl> 4.248452
$ range      <dbl> 6.171002
$ iqr        <dbl> 0.7835065
$ variance   <dbl> 0.6448855
$ ci_low     <dbl> -1.695096
$ ci_high    <dbl> 1.585147
tidy_distribution_summary_tbl(data_tbl, dist_type) |>
  glimpse()
Rows: 2
Columns: 13
$ dist_type  <fct> "Gaussian c(0, 1)", "Beta c(1, 1, 0)"
$ mean_val   <dbl> -0.0449640, 0.5276142
$ median_val <dbl> -0.0266966, 0.5301650
$ std_val    <dbl> 1.0203220, 0.2944871
$ min_val    <dbl> -2.834123047, 0.001236575
$ max_val    <dbl> 3.3368786, 0.9992146
$ skewness   <dbl> 0.03115634, -0.08744219
$ kurtosis   <dbl> 2.772527, 1.751248
$ range      <dbl> 6.171002, 0.997978
$ iqr        <dbl> 1.447849, 0.511105
$ variance   <dbl> 1.04105699, 0.08672268
$ ci_low     <dbl> -1.87309115, 0.04220623
$ ci_high    <dbl> 1.8683817, 0.9771898