# Summarizing Data in R: tapply() vs. group_by() and summarize()

rtip
Author

Steven P. Sanderson II, MPH

Published

July 26, 2023

# Introduction

Are you tired of manually calculating summary statistics for your data in R? Look no further! In this blog post, we will explore two powerful ways to summarize data: using the `tapply()` function and the `group_by()` and `summarize()` functions from the `dplyr` package. Both methods are incredibly useful and can save you time and effort in your data analysis projects.

# Using tapply() Function:

The `tapply()` function in R allows you to apply a function to subsets of a vector or array, split by one or more factors. It’s a fundamental tool for aggregating data in R. The basic syntax for `tapply()` is as follows:

``tapply(data, INDEX, FUN, ...)``
• `data`: The vector or array you want to summarize.
• `INDEX`: A list of factors or grouping variables used to split the data.
• `FUN`: The function you want to apply to each subset.
• `...`: are additional arguments that you want to pass to FUN.

## Example 1: Summarizing a Numeric Vector with tapply()

Suppose you have a dataset with students’ exam scores and their corresponding grades. You want to calculate the average score for each grade.

``````# Sample data
scores <- c(85, 90, 78, 92, 88, 76, 84, 92, 95, 89)
grades <- c("A", "A", "B", "A", "B", "C", "B", "A", "A", "B")

# Using tapply() to calculate the average score for each grade
avg_scores <- tapply(scores, grades, mean)

print(avg_scores)``````
``````    A     B     C
90.80 84.75 76.00 ``````

Or using the built in `iris` dataset:

``````mean_width_by_species <- tapply(iris\$Sepal.Width, iris\$Species, mean)

print(mean_width_by_species)``````
``````    setosa versicolor  virginica
3.428      2.770      2.974 ``````

In this example, `tapply()` splits the `scores` vector based on the different grades in the `grades` vector and calculates the average score for each grade. The same type of thing is done with the second example, splitting the data by Species.

# Using group_by() and summarize() functions from dplyr:

The `dplyr` package is a powerful tool for data manipulation in R. It provides the `group_by()` function to group data based on specific variables and the `summarize()` function to calculate summary statistics for each group.

## Example 2: Summarizing a Data Frame with group_by() and summarize()

Suppose you have a dataset with information about employees, including their department, salary, and years of experience. You want to find the average salary and the maximum years of experience for each department.

The group_by() and summarize() functions from the dplyr package provide a more concise way to summarize data. The syntax for these functions is as follows:

``````data %>%
group_by(INDEX) %>%
summarize(FUN(...))``````

Where:

• `data` is the data frame that you want to summarize.
• `INDEX` is the vector that you want to group by.
• `FUN` is the function that you want to apply to data.
• `...` are additional arguments that you want to pass to FUN.
``````# Assuming you have already installed and loaded the 'dplyr' package
library(dplyr)

# Sample data frame
employees <- data.frame(
department = c("HR", "Engineering", "HR", "Engineering", "Marketing", "Marketing"),
salary = c(50000, 65000, 48000, 70000, 55000, 60000),
experience = c(3, 5, 2, 7, 4, 6)
)

# Using group_by() and summarize() to calculate average salary
# and max experience by department
summary_data <- employees %>%
group_by(department) %>%
summarize(
avg_salary = mean(salary),
max_experience = max(experience)
)

print(summary_data)``````
``````# A tibble: 3 × 3
department  avg_salary max_experience
<chr>            <dbl>          <dbl>
1 Engineering      67500              7
2 HR               49000              3
3 Marketing        57500              6``````

The `group_by()` function groups the data by the `department` variable, and then `summarize()` calculates the average salary and maximum years of experience for each group.

Now let’s also see how the functions can produce the same results and what it looks like side by side:

``tapply(iris\$Sepal.Width, iris\$Species, mean)``
``````    setosa versicolor  virginica
3.428      2.770      2.974 ``````
``````iris %>%
group_by(Species) %>%
summarize(mean_width = mean(Sepal.Width))``````
``````# A tibble: 3 × 2
Species    mean_width
<fct>           <dbl>
1 setosa           3.43
2 versicolor       2.77
3 virginica        2.97``````

# Which method should you use?

The `tapply()` function is a more versatile function, as it can be used to apply any function to a vector, grouped by another vector. However, the group_by() and summarize() functions are more concise and easier to read.

In general, I would recommend using the `group_by()` and `summarize()` functions if you are only interested in calculating simple summary statistics. However, if you need to apply a more complex function to a vector, or if you need to group by multiple variables, then the `tapply()` function may be a better choice.

# Encouragement

Summarizing data is an essential skill in data analysis, and using the `tapply()` function and the `group_by()` and `summarize()` functions from `dplyr` can significantly simplify your workflow. I encourage you to experiment with your own datasets and try different summary functions (e.g., `median()`, `sd()`, etc.) to gain deeper insights into your data.

Feel free to explore other functions and packages in R that offer powerful data manipulation and summarization capabilities. R provides a vast ecosystem of packages to make your data analysis journey even more enjoyable. Happy coding!