Introduction

Random sampling is a fundamental technique in statistics, simulation, and data analysis. Whether you are building a model, testing a hypothesis, or simulating data, learning how to randomly select samples from your dataset is a must. In R, the built-in sample() function is an easy and powerful way to obtain random samples from vectors, data.frames, and even matrices.

In this article, we will explain the sample() function in detail, provide working examples, and show you how to perform both sampling with and without replacement. By the end, you will be able to confidently use random sampling to support your data analysis tasks in R.

Random sampling is useful for many tasks. With random samples, you can:

Test hypotheses: Evaluate if a sample represents the population.
Split data: Create training and test sets for machine learning models.
Bootstrap samples: Resample your data to estimate uncertainty.
Shuffle data: Randomize the order of data for simulation studies.

In R, the sample() function is a versatile tool that lets you randomly draw items from a collection—whether that collection is a simple vector, a data.frame, or even a matrix. In the following sections, we will explain the syntax of sample(), show examples with and without replacement, and provide sample code for various data structures.

Understanding the `sample()` Function

The basic syntax of the sample() function in R is as follows:

sample(x, size, replace = FALSE, prob = NULL)

Let’s break down the arguments:

x: The input vector (or sometimes more complex data structures) from which to sample.
size: The number of items you want to pick.
replace: A logical value indicating whether sampling is with replacement (set to TRUE) or without (the default value FALSE).
prob: An optional vector of probability weights for performing weighted sampling.

This function works by randomly shuffling the elements of x when size is not specified. When you set the size argument, sample() returns a random subset of the elements from x.

For more examples and detailed explanations on sample(), many great resources are available (https://www.statology.org/random-sample-in-r/) (https://www.r-bloggers.com/2024/03/mastering-random-sampling-in-r-with-the-sample-function/).

Sampling from Vectors

Vectors are the simplest data structure in R. Let’s start with a few examples that show how to use the sample() function to draw random samples from a vector.

Simple Random Sampling (Without Replacement)

In simple random sampling, each element is only selected once. Here’s how you can sample 5 elements from a vector of numbers without replacement:

# Create a vector with numbers 1 through 10
numbers <- 1:10

# Draw a random sample of 5 elements without replacement
sample_without_replacement <- sample(numbers, 5)

# Print the sampled values
print(sample_without_replacement)

[1]  7  9 10  4  5

Every time you run this script, you will see a different order for the five unique elements chosen from 1 to 10. This is because sampling without replacement means no element is repeated (https://www.r-bloggers.com/2024/03/mastering-random-sampling-in-r-with-the-sample-function/).

Random Sampling with Replacement

When sampling with replacement, the same element can be selected more than once. This is useful if you need to simulate scenarios where an observation might appear multiple times. To sample with replacement, simply set replace = TRUE:

# Draw a sample of 5 elements with replacement from the same vector
sample_with_replacement <- sample(numbers, 5, replace = TRUE)

# Print the sampled values
print(sample_with_replacement)

[1]  2  3  2 10 10

Because replacement is allowed, you might see the same number appear more than once (for instance, you might get 3 3 7 2 9). This method is also commonly used in bootstrapping methods .

Sampling from Data Frames

Often you need to randomly select rows from a data.frame instead of just sampling from a vector of numbers. This is very useful when splitting data into training and testing sets, or when you need a subset for exploratory analysis.

Imagine you have the following data.frame:

# Create a simple data.frame with names and ages
df <- data.frame(
  Name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  Age = c(25, 30, 35, 28, 22)
)

# Randomly select 3 rows from the data.frame without replacement
df_sample <- df[sample(nrow(df), 3), ]

# Print the sampled data.frame
print(df_sample)

   Name Age
4 Diana  28
5   Eve  22
2   Bob  30

Explanation:

nrow(df): This function inside sample() returns the total number of rows in the data.frame.
sample(nrow(df), 3): This random function selects 3 unique row numbers from the total available.
df[ … , ]: We then subset the original data.frame using the randomly chosen row numbers.

This will return a new data.frame with 3 randomly selected rows, which might be useful for quick exploratory analysis or as input for further processing. Sampling rows using this technique is common when the dataset is large and you need to quickly check a random subset .

Sampling from a Matrix

Matrices in R are two-dimensional arrays, and you can also use the sample() function to work with them. The following examples demonstrate two common approaches to sampling from a matrix: sampling random elements from the entire matrix and sampling random rows.

Sampling Random Elements from a Matrix

You might want to pick random elements from a matrix regardless of rows and columns. For example:

# Create a 3x3 matrix with numbers from 1 to 9
matrix_data <- matrix(1:9, nrow = 3, byrow = TRUE)
print(matrix_data)

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

# Sample 4 random elements from the matrix (note: the matrix is treated as a vector)
random_elements <- sample(matrix_data, 4)

# Print the sampled elements
print(random_elements)

[1] 8 2 5 6

Explanation: - R internally treats the matrix as a vector when passed to sample(). Hence, the function picks 4 random values from all values in the matrix.

Sampling Rows of a Matrix

If you need to randomly select rows (maintaining the matrix structure), you can do this by sampling the row indices:

# Sample 2 random rows from the matrix
random_rows <- matrix_data[sample(nrow(matrix_data), 2), ]

# Print the sampled rows, which still keep the matrix-like structure
print(random_rows)

     [,1] [,2] [,3]
[1,]    4    5    6
[2,]    1    2    3

Explanation:

nrow(matrix_data): Returns the number of rows in the matrix.
sample(nrow(matrix_data), 2): Randomly selects 2 row indices.
matrix_data[ … , ]: Subsets the matrix by these rows and returns a matrix.

This technique is particularly useful when dealing with multivariate data stored as a matrix and you wish to preserve entire rows for subsequent analysis .

Weighted Random Sampling

Sometimes, you need elements to have a higher chance of being selected. This is where the prob argument comes into play. For example, let’s say you have a vector representing four options and they should not all have the same chance of appearing in the sample:

# Define a vector representing four different items
items <- c("Apple", "Banana", "Cherry", "Date")

# Define the weights so that "Date" has the highest probability of selection
weights <- c(0.1, 0.2, 0.3, 0.4)

# Sample 3 elements from items using the weights (without replacement)
weighted_sample <- sample(items, 3, replace = FALSE, prob = weights)

# Print the weighted sample
print(weighted_sample)

[1] "Date"   "Cherry" "Apple"

Explanation:

The prob parameter assigns selection probabilities. In this example, “Date” (with the highest weight of 0.4) is more likely to be picked.
Using weighted sampling can be very useful, for example, when you simulate real-world scenarios where some events occur more frequently than others.

Using `set.seed()` for Reproducible Results

In random sampling, you might want to generate the same random output each time you run your code—especially when sharing code with colleagues or including examples in your reports. R’s set.seed() function lets you do exactly that.

For example:

# Set the seed to ensure reproducibility
set.seed(42)

# Sample 5 numbers without replacement from 1 to 10
reproducible_sample <- sample(1:10, 5)
print(reproducible_sample)

[1]  1  5 10  8  2

# Re-run with the same seed to see the same sample
set.seed(42)
reproducible_sample <- sample(1:10, 5)
print(reproducible_sample)

[1]  1  5 10  8  2

Using set.seed() guarantees that the random sequence is the same in every run, which is important for debugging and sharing reproducible research .

Your Turn

Now that we have covered the basics and more advanced examples of using the sample() function, it’s time for you to practice! Here are some exercises to try on your own:

Click to see solution.

Exercise 1:
Generate a random sample of 10 elements from the letters of the English alphabet without replacement.

Hint: Use letters (a built-in vector in R) and the sample() function.

# Answer:
sample_letters <- sample(letters, 10, replace = FALSE)
print(sample_letters)

 [1] "d" "r" "q" "o" "g" "z" "e" "n" "y" "w"

Exercise 2:
Sample 5 elements with replacement from the vector c(10, 20, 30, 40, 50).

# Answer:
sample_numbers <- sample(c(10, 20, 30, 40, 50), 5, replace = TRUE)
print(sample_numbers)

[1] 30 10 10 30 40

Exercise 3:
Create a vector of weights and perform weighted random sampling to select 3 elements from the vector c("apple", "banana", "orange", "grape").
Make sure that “orange” has the highest probability of being selected.

# Answer:
fruit_items <- c("apple", "banana", "orange", "grape")
fruit_weights <- c(0.2, 0.2, 0.4, 0.2)
weighted_fruit_sample <- sample(fruit_items, 3, replace = FALSE, prob = fruit_weights)
print(weighted_fruit_sample)

[1] "orange" "apple"  "banana"

Experiment with these exercises by changing the parameters. This will help solidify your understanding of random sampling in R.

Key Takeaways

sample() Function:
The sample() function in R is robust for drawing random elements from vectors, data.frames, and even matrices.
With vs. Without Replacement:
– Use replace = FALSE for unique sampling.
– Use replace = TRUE when you allow repeated values in the sample.
Working with Complex Data Structures:
You can sample rows from data.frames or from entire matrices by using functions such as nrow() to index your data.
Weighted Sampling:
The prob argument allows you to specify weights for elements, making some more likely to be sampled than others.
Reproducibility:
Use set.seed() to ensure that your random samples are the same across multiple runs, which is critical for reproducible research.

These points will help guide your use of random sampling.

Frequently Asked Questions (FAQs)

Q1: What does sampling with replacement mean?
A: Sampling with replacement means that when you choose an element, it is “put back” into the pool of values. This allows the same element to be selected more than once. For example, using sample(numbers, 5, replace = TRUE) might select one number twice while missing another .

Q2: How is weighted random sampling useful?
A: Weighted random sampling allows you to assign different probabilities to each element in your vector. This is useful in simulations where certain outcomes are more likely than others. By using the prob argument, you can simulate more realistic scenarios where not all elements have an equal chance of being selected.

Q3: Can I use the sample() function on data.frames?
A: Yes, you can. By sampling the row indices using sample(nrow(your_dataframe), size), you can randomly select rows from a data.frame. This method is especially useful for creating training and testing sets.

Q4: How do I ensure that my random sampling is reproducible?
A: Use the set.seed() function at the start of your script. Setting a seed (e.g., set.seed(42)) ensures that the sequence of random numbers—and thus your samples—is the same each time you run your code.

Q5: What if I try to sample more elements than are available in my vector?
A: If you attempt to sample without replacement more elements than exist in the vector, R will return an error. To prevent this, either ensure that the requested size does not exceed the length of the vector or set replace = TRUE if duplicates are acceptable.

Conclusion

Random sampling is an essential skill for any R programmer. Whether you’re working with simple vectors, data.frames, or matrices, the sample() function allows you to extract random subsets of your data with ease. In this article, we covered how to use sample() for both non-repetitive selection (without replacement) and for allowing repeated values (with replacement). We also touched on weighted sampling—useful when some elements should be more likely to appear—and demonstrated how to achieve reproducibility using set.seed().

Make sure to experiment with the different options provided by the sample() function as part of your workflow.

If you found this article helpful, please feel free to comment below, share it on your favorite social media channels, or subscribe for more R programming tutorials.