# Unleashing the Power of Sampling in R: Exploring the Versatile sample() Function

rtip
Author

Steven P. Sanderson II, MPH

Published

June 21, 2023

# Introduction

Sampling is a fundamental technique in data analysis and statistical modeling. It allows us to draw meaningful insights and make inferences about a larger population based on a representative subset. In the world of R programming, the `sample()` function stands as a versatile tool that enables us to create random samples efficiently. In this post, we will explore the `sample()` function and its various applications through a series of plain English examples.

First, let’s take a look at the syntax:

``sample(x, size, replace = FALSE, prob = NULL)``

where:

• `x` is the dataset or vector from which to take the sample
• `size` is the number of elements to include in the sample
• `replace` is a logical value that indicates whether or not to allow sampling with replacement (the default is FALSE)
• `prob` is a vector of probabilities that can be used to weight the sample (the default is NULL)

# Examples

## Example 1: Simple Random Sampling

Let’s say we have a dataset containing the ages of 100 people. To create a random sample of 10 individuals, we can use the `sample()` function as follows:

``````ages <- 1:100
random_sample <- sample(ages, size = 10)
random_sample``````
`` [1] 53 13 84 50 55  9 12 38 79 15``

The `sample()` function randomly selects 10 values from the `ages` vector, without replacement, resulting in a new vector named `random_sample`. This technique represents simple random sampling, where each individual in the population has an equal chance of being included in the sample.

## Example 2: Sampling with Replacement

In some scenarios, we might want to allow repeated selections from the population. Let’s say we have a bag with colored balls, and we want to simulate drawing 5 balls with replacement. Here’s how we can achieve it:

``````colors <- c("red", "blue", "green", "yellow")
sample_with_replacement <- sample(colors, size = 5, replace = TRUE)
sample_with_replacement``````
``[1] "yellow" "yellow" "green"  "green"  "red"   ``

The `sample()` function, with the `replace = TRUE` argument, enables us to randomly select 5 colors from the `colors` vector, allowing duplicates. This approach represents sampling with replacement, where each selection is independent of the previous ones.

## Example 3: Weighted Sampling

In certain situations, we may want to assign different probabilities to elements in the population. Let’s assume we have a list of items and corresponding weights denoting their probabilities of being selected. We can use the sample() function with the `prob` parameter to achieve weighted sampling. Consider the following example:

``````library(dplyr)

items <- c("apple", "banana", "orange")
weights <- c(0.4, 0.2, 0.4)
weighted_sample <- sample(items, size = 1, prob = weights)
weighted_sample``````
``[1] "apple"``
``````tibble(x = 1:10) |>
group_by(x) |>
mutate(rs = sample(items, size = 1, prob = weights)) |>
ungroup()``````
``````# A tibble: 10 × 2
x rs
<int> <chr>
1     1 orange
2     2 apple
3     3 apple
4     4 apple
5     5 apple
6     6 orange
7     7 orange
8     8 orange
9     9 apple
10    10 orange``````

By specifying the `prob` argument with the corresponding weights, the `sample()` function randomly selects a single item from the `items` vector. The probability of each item being chosen is proportional to its weight. In this case, “apple” and “orange” have a higher chance (40% each) of being selected compared to “banana” (20%).

## Example 4: Stratified Sampling

Stratified sampling involves dividing the population into subgroups or strata and then sampling from each stratum proportionally. Let’s assume we have a dataset of students’ grades in different subjects, and we want to select a sample that maintains the proportion of students from each subject. We can achieve this using the `sample()` function along with additional parameters. Consider the following example:

``````subjects <- c("Math", "Science", "English", "History")
grades <- c(80, 90, 85, 70, 75, 95, 60, 92, 88, 83, 78, 91)
strata <- factor(subjects)
stratified_sample <- unlist(
by(
``````English1 English2 History1 History2    Math1    Math2 Science1 Science2
In this example, we use the by() function to group the grades by subject (`strata`). Then, we apply the sample() function to each subgroup (subject) using the FUN argument. The result is a stratified sample of two grades from each subject, maintaining the relative proportions of students in the final sample.