How to Calculate Percentage by Group in R using Base R, dplyr, and data.table

rtip
Author

Steven P. Sanderson II, MPH

Published

July 24, 2023

Introduction

Calculating percentages by group is a common task in data analysis. It allows you to understand the distribution of data within different categories. In this blog post, we’ll walk you through the process of calculating percentages by group using three popular R packages: Base R, dplyr, and data.table. To keep things simple, we will use the well-known Iris dataset.

The Iris dataset contains information about different species of iris flowers and their measurements, including sepal length, sepal width, petal length, and petal width. We will focus on the ‘Species’ column and calculate the percentage of each species in the dataset.

Examples

Example 1: Using Base R

Step 1: Load the Iris dataset

# Load the Iris dataset
data(iris)

Step 2: Calculate the counts by group

# Use the table() function to get the counts of each species
group_counts <- table(iris$Species)

Step 3: Calculate the total count

# Calculate the total count using the sum() function
total_count <- sum(group_counts)

Step 4: Calculate the percentage by group

# Divide each count by the total count and multiply by 100 to get the percentage
percentage_by_group <- (group_counts / total_count) * 100

Step 5: Combine group names and percentages into a data frame and display the result

# Combine group names and percentages into a data frame
result_base_R <- data.frame(
  Species = names(percentage_by_group), 
  Percentage = percentage_by_group
  )

# Print the result
print(result_base_R)
     Species Percentage.Var1 Percentage.Freq
1     setosa          setosa        33.33333
2 versicolor      versicolor        33.33333
3  virginica       virginica        33.33333

Example 2: Using dplyr

Step 1: Load the necessary library and the Iris dataset

# Load the dplyr library
library(dplyr)

# Load the Iris dataset
data(iris)

Step 2: Calculate the percentage by group using dplyr

# Use the group_by() and summarise() functions to calculate percentages
result_dplyr <- iris %>%
  group_by(Species) %>%
  summarise(Percentage = n() / nrow(iris) * 100)

Step 3: Display the result

# Print the result
print(result_dplyr)
# A tibble: 3 × 2
  Species    Percentage
  <fct>           <dbl>
1 setosa           33.3
2 versicolor       33.3
3 virginica        33.3

Example 3: Using data.table:

Step 1: Load the necessary library and the Iris dataset

# Load the data.table library
library(data.table)

# Convert the Iris dataset to a data.table
iris_dt <- as.data.table(iris)

Step 2: Calculate the percentage by group using data.table

# Use the .N special symbol to calculate counts and by-reference to save memory
result_data_table <- iris_dt[, .(Percentage = .N / nrow(iris_dt) * 100), by = Species]

Step 3: Display the result

# Print the result
print(result_data_table)
      Species Percentage
1:     setosa   33.33333
2: versicolor   33.33333
3:  virginica   33.33333

Conclusion

In this blog post, we demonstrated three methods to calculate percentages by group in R using Base R, dplyr, and data.table. Each method has its advantages, and you can choose the one that suits your needs and preferences. The key takeaway is that understanding the distribution of data within groups can provide valuable insights in data analysis. We encourage you to try these methods on your own datasets and explore further possibilities with these powerful R packages. Happy coding!