# Wrangling Data with R: A Guide to the tapply() Function

code
rtip
operations
Author

Steven P. Sanderson II, MPH

Published

March 25, 2024

# Introduction

Hey R enthusiasts! Today we’re diving into the world of data manipulation with a fantastic function called `tapply()`. This little gem lets you apply a function of your choice to different subgroups within your data.

Imagine you have a dataset on trees, with a column for tree height and another for species. You might want to know the average height for each species. `tapply()` comes to the rescue!

# Understanding the Syntax

Let’s break down the syntax of `tapply()`:

``tapply(X, INDEX, FUN, simplify = TRUE)``
• X: This is the vector or variable you want to perform the function on.
• INDEX: This is the factor variable that defines the groups. Each level in the factor acts as a subgroup for applying the function.
• FUN: This is the function you want to apply to each subgroup. It can be built-in functions like `mean()` or `sd()`, or even custom functions you write!
• simplify (optional): By default, `simplify = TRUE` (recommended for most cases). This returns a nice, condensed output that’s easy to work with. Setting it to `FALSE` gives you a more complex structure.

# Examples in Action

## Example 1: Average Tree Height by Species

Let’s say we have a data frame `trees` with columns “height” (numeric) and “species” (factor):

``````# Sample data
trees <- data.frame(height = c(20, 30, 25, 40, 15, 28),
species = c("Oak", "Oak", "Maple", "Pine", "Maple", "Pine"))

# Average height per species
average_height <- tapply(trees\$height, trees\$species, mean)
print(average_height)``````
``````Maple   Oak  Pine
20    25    34 ``````

This code calculates the average height for each species in the “species” column and stores the results in `average_height`. The output will be a named vector showing the average height for each unique species.

## Example 2: Exploring Distribution with Summary Statistics

We can use `tapply()` with `summary()` to get a quick overview of how a variable is distributed within groups. Here, we’ll see the distribution of height within each species:

``````summary_by_species <- tapply(trees\$height, trees\$species, summary)
print(summary_by_species)``````
``````\$Maple
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
15.0    17.5    20.0    20.0    22.5    25.0

\$Oak
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
20.0    22.5    25.0    25.0    27.5    30.0

\$Pine
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
28      31      34      34      37      40 ``````

This code applies the `summary()` function to each subgroup defined by the “species” factor. The output will be a data frame showing various summary statistics (like minimum, maximum, quartiles) for the height of each species.

## Example 3: Custom Function for Identifying Tall Trees

Let’s create a custom function to find trees that are taller than the average height of their species:

``````tall_trees <- function(height, avg_height) {
height > avg_height
}

# Find tall trees within each species
tall_trees_by_species <- tapply(trees\$height, trees\$species, mean(trees\$height),FUN=tall_trees)
print(tall_trees_by_species)``````
``````\$Maple
[1] FALSE FALSE

\$Oak
[1] FALSE  TRUE

\$Pine
[1] TRUE TRUE``````

Here, we define a function `tall_trees()` that takes a tree’s height and the average height (passed as arguments) and returns TRUE if the tree’s height is greater. We then use `tapply()` with this custom function. The crucial difference here is that we use `mean(trees\$height)` within the `FUN` argument to calculate the average height for each group outside of the custom function. This ensures the average height is calculated correctly for each subgroup before being compared to individual tree heights. The output will be a logical vector for each species, indicating which trees are taller than the average.

# Give it a Try!

This is just a taste of what `tapply()` can do. There are endless possibilities for grouping data and applying functions. Try it out on your own datasets! Here are some ideas:

• Calculate the median income for different age groups.
• Find the most frequent word used in emails sent by different departments.
• Group customers by purchase history and analyze their average spending.

Remember, R is all about exploration. So dive in, play with `tapply()`, and see what insights you can uncover from your data!