Bootstrap Analysis

set.seed(123)
library(TidyDensity)
library(dplyr)
library(ggplot2)
library(patchwork)

Bootstrap resampling is a powerful statistical technique for robust inference. TidyDensity provides integrated bootstrap functionality with seamless visualization and analysis tools.

What is Bootstrap?

Concept

Bootstrap resampling is a non-parametric method for:

Estimating sampling distributions
Calculating confidence intervals
Assessing parameter uncertainty
Making inferences without distributional assumptions

How It Works

Resample your data with replacement
Calculate statistic of interest for each resample
Repeat process many times (typically 1000-10000)
Analyze distribution of bootstrap statistics

When to Use Bootstrap

Unknown or complex sampling distributions
Small to medium sample sizes
When parametric assumptions are questionable
For robust confidence intervals
Complex statistics (median, trimmed mean, etc.)

Bootstrap in TidyDensity

Main Function: `tidy_bootstrap()`

Generate bootstrap samples in tidy format:

tidy_bootstrap(
  .x,                    # Your data vector
  .num_sims = 2000,      # Number of bootstrap samples
  .proportion = 0.8,     # Proportion to sample (default = 0.8)
  .distribution_type = "continuous"  # continuous or discrete
)

Return Value

Returns a tidy tibble with:

sim_number - Bootstrap sample identifier
bootstrap_samples - List column of bootstrap samples of .x

Basic Bootstrap Analysis

Simple Bootstrap Example

# Your data
data <- mtcars$mpg

# Perform bootstrap
bootstrap_data <- tidy_bootstrap(
  .x = data,
  .num_sims = 2000
)

# View structure
head(bootstrap_data)
#> # A tibble: 6 × 2
#>   sim_number bootstrap_samples
#>   <fct>      <list>           
#> 1 1          <dbl [25]>       
#> 2 2          <dbl [25]>       
#> 3 3          <dbl [25]>       
#> 4 4          <dbl [25]>       
#> 5 5          <dbl [25]>       
#> 6 6          <dbl [25]>

Visualizing Bootstrap Distribution

# Density plot of bootstrap distribution Cumulative Mean
bootstrap_stat_plot(bootstrap_data, .value = y, .stat = "cmean")

Line plot showing the cumulative mean of bootstrap samples over simulation numbers, demonstrating convergence of the bootstrap mean estimate

# Cumulative Harmonic Mean
bootstrap_stat_plot(bootstrap_data, .value = y, .stat = "chmean")

Line plot showing the cumulative harmonic mean of bootstrap samples over simulation numbers

# Show Groups
bootstrap_stat_plot(bootstrap_data, .value = y, .stat = "cmean",
                    .show_groups = TRUE)

Line plot showing cumulative mean with individual simulation groups displayed, illustrating the variability across bootstrap samples

Quick Statistics

# Get basic statistics
summary(bootstrap_data |> bootstrap_unnest_tbl() |> pull(y))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   10.40   15.50   19.20   20.12   22.80   33.90

# Count simulations
length(unique(bootstrap_data$sim_number))
#> [1] 2000

Bootstrap Statistics

Unnesting Bootstrap Data

Use bootstrap_unnest_tbl() to work with bootstrap samples:

# Unnest the bootstrap data
unnested <- bootstrap_data |>
  bootstrap_unnest_tbl()

# Now you can calculate statistics
head(unnested)
#> # A tibble: 6 × 2
#>   sim_number     y
#>   <fct>      <dbl>
#> 1 1           15  
#> 2 1           10.4
#> 3 1           30.4
#> 4 1           15.2
#> 5 1           22.8
#> 6 1           19.2

Calculating Bootstrap Statistics

# Calculate statistics for each bootstrap sample
bootstrap_stats <- bootstrap_data |>
  bootstrap_unnest_tbl() |>
  group_by(sim_number) |>
  summarise(
    mean = mean(y),
    median = median(y),
    sd = sd(y),
    q25 = quantile(y, 0.25),
    q75 = quantile(y, 0.75)
  )

# View distribution of means
summary(bootstrap_stats$mean)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   16.47   19.34   20.09   20.12   20.91   24.79
hist(bootstrap_stats$mean, main = "Bootstrap Distribution of Mean",
     xlab = "Mean", col = "lightblue")

Histogram of bootstrap means showing the sampling distribution of the mean statistic across 2000 bootstrap samples

Overall Bootstrap Statistics

# Calculate overall statistics from all bootstrap samples
overall_stats <- bootstrap_data |>
  bootstrap_unnest_tbl() |>
  summarise(
    mean_est = mean(y),
    sd_est = sd(y),
    median_est = median(y)
  )

overall_stats
#> # A tibble: 1 × 3
#>   mean_est sd_est median_est
#>      <dbl>  <dbl>      <dbl>
#> 1     20.1   5.94       19.2

Confidence Intervals

Bootstrap Percentile Method

Most common and intuitive method:

# Calculate 95% confidence intervals
ci_level <- 0.95
alpha <- 1 - ci_level

bootstrap_ci <- bootstrap_data |>
  bootstrap_unnest_tbl() |>
  summarise(
    lower_ci = quantile(y, alpha/2),
    upper_ci = quantile(y, 1 - alpha/2),
    point_estimate = mean(y)
  )

bootstrap_ci
#> # A tibble: 1 × 3
#>   lower_ci upper_ci point_estimate
#>      <dbl>    <dbl>          <dbl>
#> 1     10.4     33.9           20.1

Confidence Intervals for Multiple Statistics

# Calculate CI for mean, median, and sd
ci_stats <- bootstrap_data |>
  bootstrap_unnest_tbl() |>
  group_by(sim_number) |>
  summarise(
    mean = mean(y),
    median = median(y),
    sd = sd(y)
  ) |>
  summarise(
    across(
      c(mean, median, sd),
      list(
        lower = ~ unname(quantile(.x, 0.025)),
        estimate = ~ unname(mean(.x)),
        upper = ~ unname(quantile(.x, 0.975))
      )
    )
  )

glimpse(ci_stats)
#> Rows: 1
#> Columns: 9
#> $ mean_lower      <dbl> 17.6837
#> $ mean_estimate   <dbl> 20.11925
#> $ mean_upper      <dbl> 22.5564
#> $ median_lower    <dbl> 16.4
#> $ median_estimate <dbl> 19.28125
#> $ median_upper    <dbl> 21.4025
#> $ sd_lower        <dbl> 4.104046
#> $ sd_estimate     <dbl> 5.879158
#> $ sd_upper        <dbl> 7.384673

Visualizing Confidence Intervals

# Create visualization
unnested |>
  ggplot(aes(x = y)) +
  geom_density(fill = "lightblue", alpha = 0.5) +
  geom_vline(xintercept = unname(quantile(unnested$y, 0.025)), 
             linetype = "dashed", color = "red") +
  geom_vline(xintercept = unname(quantile(unnested$y, 0.975)), 
             linetype = "dashed", color = "red") +
  geom_vline(xintercept = mean(unnested$y), 
             linetype = "solid", color = "darkblue", linewidth = 1) +
  labs(
    title = "Bootstrap Distribution with 95% CI",
    x = "Bootstrap Statistic",
    y = "Density"
  ) +
  theme_minimal()

Density plot of bootstrap distribution with vertical dashed red lines indicating the 2.5% and 97.5% percentiles for the 95% confidence interval, and a solid blue line showing the mean

Bootstrap Augmentation Functions

Augment Density

Add density calculations to bootstrap data:

# Augment with density information
augmented_density <- bootstrap_data |>
  bootstrap_density_augment()

head(augmented_density)
#> # A tibble: 6 × 5
#>   sim_number     x     y    dx        dy
#>   <fct>      <int> <dbl> <dbl>     <dbl>
#> 1 1              1  15    1.30 0.0000593
#> 2 1              2  10.4  3.04 0.000286 
#> 3 1              3  30.4  4.78 0.00102  
#> 4 1              4  15.2  6.51 0.00279  
#> 5 1              5  22.8  8.25 0.00624  
#> 6 1              6  19.2  9.99 0.0124

Augment Probability

Add cumulative probability:

# Augment with probability information
augmented_prob <- bootstrap_data |>
  bootstrap_unnest_tbl() |>
  bootstrap_p_augment(y)

head(augmented_prob)
#> # A tibble: 6 × 3
#>   sim_number     y      p
#>   <fct>      <dbl>  <dbl>
#> 1 1           15   0.186 
#> 2 1           10.4 0.0618
#> 3 1           30.4 0.937 
#> 4 1           15.2 0.249 
#> 5 1           22.8 0.779 
#> 6 1           19.2 0.530

Augment Quantile

Add quantile information:

# Augment with quantile information
augmented_quantile <- bootstrap_data |>
  bootstrap_unnest_tbl() |>
  bootstrap_q_augment(y)

head(augmented_quantile)
#> # A tibble: 6 × 3
#>   sim_number     y     q
#>   <fct>      <dbl> <dbl>
#> 1 1           15    10.4
#> 2 1           10.4  10.4
#> 3 1           30.4  10.4
#> 4 1           15.2  10.4
#> 5 1           22.8  10.4
#> 6 1           19.2  10.4

Advanced Bootstrap Techniques

Bootstrap for Difference of Means

# Two groups
group1 <- mtcars$mpg[mtcars$am == 0]
group2 <- mtcars$mpg[mtcars$am == 1]

# Bootstrap function for difference
bootstrap_diff <- function(n_sims = 2000) {
  diffs <- numeric(n_sims)
  
  for (i in 1:n_sims) {
    boot_g1 <- sample(group1, length(group1), replace = TRUE)
    boot_g2 <- sample(group2, length(group2), replace = TRUE)
    diffs[i] <- mean(boot_g2) - mean(boot_g1)
  }
  
  return(diffs)
}

# Run bootstrap
diff_dist <- bootstrap_diff(2000)

# Calculate CI
quantile(diff_dist, c(0.025, 0.975))
#>      2.5%     97.5% 
#>  3.827439 10.924939

# Visualize
hist(diff_dist, main = "Bootstrap Distribution of Difference in Means",
     xlab = "Difference in Means", breaks = 50, col = "lightgreen")
abline(v = quantile(diff_dist, c(0.025, 0.975)), 
       col = "red", lty = 2, lwd = 2)

Histogram showing the bootstrap distribution of the difference in means between automatic and manual transmission cars, with vertical red dashed lines indicating the 95% confidence interval

Bootstrap for Correlation

# Original correlation
cor_original <- cor(mtcars$mpg, mtcars$wt)

# Bootstrap correlations
boot_cor <- function(x, y, n_sims = 2000) {
  cors <- numeric(n_sims)
  n <- length(x)
  
  for (i in 1:n_sims) {
    indices <- sample(n, replace = TRUE)
    cors[i] <- cor(x[indices], y[indices])
  }
  
  return(cors)
}

# Run bootstrap
cor_dist <- boot_cor(mtcars$mpg, mtcars$wt, 2000)

# CI for correlation
cor_ci <- quantile(cor_dist, c(0.025, 0.975))
cat("95% CI for correlation:", cor_ci, "\n")
#> 95% CI for correlation: -0.9276411 -0.7921946

# Visualize
hist(cor_dist, main = "Bootstrap Distribution of Correlation",
     xlab = "Correlation Coefficient", breaks = 50, col = "lightyellow")
abline(v = cor_ci, col = "red", lty = 2, lwd = 2)
abline(v = cor_original, col = "blue", lwd = 2)

Histogram showing the bootstrap distribution of the correlation between mpg and weight, with vertical red dashed lines for the 95% confidence interval and a blue line for the original correlation estimate

Visualization

Multiple Visualizations

# Generate bootstrap data
boot_data <- tidy_bootstrap(mtcars$mpg, .num_sims = 2000) |>
  bootstrap_unnest_tbl()

# Create multiple plots
p1 <- ggplot(aes(x = y), data = boot_data) +
  geom_density(fill = "lightgreen", alpha = 0.5) +
  labs(title = "Density Plot", x = "Value", y = "Density") +
  theme_minimal()

p2 <- ggplot(aes(x = y), data = boot_data) +
  stat_ecdf(aes(x = y), geom = "step", color = "blue") +
  labs(title = "Probability Plot", x = "Value", y = "Cumulative Probability") +
  theme_minimal()

p3 <- ggplot(aes(x = seq_along(y), y = sort(y)), data = boot_data) +
  geom_point(color = "purple", alpha = 0.1) +
  labs(title = "Sorted Values", x = "Index", y = "Value") +
  theme_minimal()

p4 <- ggplot(aes(sample = y), data = boot_data) +
  stat_qq(color = "orange", alpha = 0.1) +
  stat_qq_line(color = "red") +
  labs(title = "QQ Plot", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal()

# Combine
(p1 | p2) / (p3 | p4)

Four-panel display showing density plot, probability (ECDF) plot, sorted values plot, and Q-Q plot for bootstrap samples, providing comprehensive visualization of bootstrap distribution characteristics

Custom Visualization with CI

# Calculate statistics
ci <- quantile(boot_data$y, c(0.025, 0.5, 0.975))

# Create plot
ggplot(data.frame(y = boot_data$y), aes(x = y)) +
  geom_density(fill = "skyblue", alpha = 0.5) +
  geom_vline(xintercept = ci[1], linetype = "dashed", color = "red") +
  geom_vline(xintercept = ci[2], linetype = "solid", color = "darkblue", 
             linewidth = 1.5) +
  geom_vline(xintercept = ci[3], linetype = "dashed", color = "red") +
  annotate("text", x = ci[1], y = 0, label = "2.5%", vjust = -1) +
  annotate("text", x = ci[2], y = 0, label = "Median", vjust = -1) +
  annotate("text", x = ci[3], y = 0, label = "97.5%", vjust = -1) +
  labs(
    title = "Bootstrap Distribution with Confidence Interval",
    x = "Statistic Value",
    y = "Density"
  ) +
  theme_minimal()

Density plot of bootstrap distribution with annotated 95% confidence interval showing lower bound at 2.5 percentile, median at 50th percentile, and upper bound at 97.5 percentile

Best Practices

1. Choose Appropriate Number of Simulations

General guidelines:

Exploratory: 1000 simulations
Standard analysis: 2000-5000 simulations
Publication/Critical: 10000+ simulations

2. Verify Bootstrap Convergence

Check stability with different numbers of simulations:

# Check stability with different numbers of simulations
n_sims_vec <- c(500, 1000, 2000)
convergence <- data.frame(
  n_sims = n_sims_vec,
  mean_est = NA,
  ci_lower = NA,
  ci_upper = NA
)

for (i in seq_along(n_sims_vec)) {
  boot <- tidy_bootstrap(data, .num_sims = n_sims_vec[i])
  stats <- boot |>
    bootstrap_unnest_tbl() |>
    summarise(
      mean = mean(y),
      lower = quantile(y, 0.025),
      upper = quantile(y, 0.975)
    )
  
  convergence[i, 2:4] <- as.numeric(stats)
}

convergence
#>   n_sims mean_est ci_lower ci_upper
#> 1    500 20.11131     10.4     33.9
#> 2   1000 20.06232     10.4     33.9
#> 3   2000 20.07139     10.4     33.9

3. Consider Sample Size

Bootstrap reliability depends on sample size:

n <- length(data)

if (n < 20) {
  message("Small sample size. Bootstrap may be less reliable.")
  message("Consider: n >= 30 for bootstrap")
}

cat("Sample size:", n, "\n")
#> Sample size: 32

4. Understand Limitations

Bootstrap works well for:

Means, medians, quantiles
Standard errors
Confidence intervals

Bootstrap may struggle with:

Extreme values (min/max)
Very small samples
Heavily dependent data
Some complex statistics

Troubleshooting

Issue: CI Too Wide

Causes:

High variability in data
Small sample size
Insufficient bootstrap samples

Solutions:

Increase bootstrap samples
Check data variability with coefficient of variation: sd(data) / mean(data)
Collect more data if possible

Issue: Long Computation Time