Simplifying Data Transformation with pivot_longer() in R’s tidyr Library

rtip
tidyr
Author

Steven P. Sanderson II, MPH

Published

June 6, 2023

Introduction

In the world of data analysis and manipulation, tidying and reshaping data is often an essential step. R’s tidyr library provides powerful tools to efficiently transform and reshape data. One such function is pivot_longer(). In this blog post, we’ll explore how pivot_longer() works and demonstrate its usage through several examples. By the end, you’ll have a solid understanding of how to use this function to make your data more manageable and insightful.

The tidyr library holds the function, so we are going to have to load it first.

library(tidyr)

Understanding pivot_longer()

The pivot_longer() function is designed to reshape data from a wider format to a longer format. It takes columns that represent different variables and consolidates them into key-value pairs, making it easier to analyze and visualize the data.

Syntax: The basic syntax of pivot_longer() is as follows:

pivot_longer(data, cols, names_to, values_to)
  • data: The data frame or tibble to be reshaped.
  • cols: The columns to be transformed.
  • names_to: The name of the new column that will hold the variable names.
  • values_to: The name of the new column that will hold the corresponding values.

Example 1: Reshaping Wide Data to Long Data

Let’s start with a simple example to demonstrate the usage of pivot_longer(). Suppose we have a data frame called students with columns representing subjects and their respective scores:

students <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  math = c(90, 85, 92),
  science = c(95, 88, 91),
  history = c(87, 92, 78)
)

To reshape this data from a wider format to a longer format, we can use pivot_longer() as follows:

students_long <- pivot_longer(
  students, 
  cols = -name, 
  names_to = "subject", 
  values_to = "score"
  )

students_long
# A tibble: 9 × 3
  name    subject score
  <chr>   <chr>   <dbl>
1 Alice   math       90
2 Alice   science    95
3 Alice   history    87
4 Bob     math       85
5 Bob     science    88
6 Bob     history    92
7 Charlie math       92
8 Charlie science    91
9 Charlie history    78

The resulting students_long data frame will have three columns: name, subject, and score, where each row represents a student’s score in a specific subject.

Example 2: Handling Multiple Variables In many cases, data frames contain multiple variables that need to be pivoted simultaneously. Consider a data frame called sales with columns representing sales figures for different products in different regions:

sales <- data.frame(
  region = c("North", "South", "East"),
  product_A = c(100, 120, 150),
  product_B = c(80, 90, 110),
  product_C = c(60, 70, 80)
)

To reshape this data, we can specify multiple columns to pivot using pivot_longer():

sales_long <- pivot_longer(
  sales, 
  cols = starts_with("product"), 
  names_to = "product", 
  values_to = "sales"
  )

sales_long
# A tibble: 9 × 3
  region product   sales
  <chr>  <chr>     <dbl>
1 North  product_A   100
2 North  product_B    80
3 North  product_C    60
4 South  product_A   120
5 South  product_B    90
6 South  product_C    70
7 East   product_A   150
8 East   product_B   110
9 East   product_C    80

The resulting sales_long data frame will have three columns: region, product, and sales, where each row represents the sales figure of a specific product in a particular region.

Example 3: Handling Irregular Data

Sometimes, data frames contain irregular structures, such as missing values or uneven numbers of columns. pivot_longer() can handle such scenarios gracefully. Consider a data frame called measurements with columns representing different measurement types and their respective values:

measurements <- data.frame(
  timestamp = c("2022-01-01", "2022-01-02", "2022-01-03"),
  temperature = c(25.3, 27.1, 24.8),
  humidity = c(65.2, NA, 68.5),
  pressure = c(1013, 1012, NA)
)

To reshape this data, we can use pivot_longer() and handle the missing values:

measurements_long <- pivot_longer(
  measurements, 
  cols = -timestamp, 
  names_to = "measurement", 
  values_to = "value", 
  values_drop_na = TRUE
  )

measurements_long
# A tibble: 7 × 3
  timestamp  measurement  value
  <chr>      <chr>        <dbl>
1 2022-01-01 temperature   25.3
2 2022-01-01 humidity      65.2
3 2022-01-01 pressure    1013  
4 2022-01-02 temperature   27.1
5 2022-01-02 pressure    1012  
6 2022-01-03 temperature   24.8
7 2022-01-03 humidity      68.5

The resulting measurements_long data frame will have three columns: timestamp, measurement, and value, where each row represents a specific measurement at a particular timestamp. The values_drop_na argument ensures that rows with missing values are dropped.

Conclusion

In this blog post, we explored the pivot_longer() function from the tidyr library, which allows us to reshape data from a wider format to a longer format. We covered the syntax and provided several examples to illustrate its usage. By mastering pivot_longer(), you’ll be equipped to tidy your data and unleash its true potential for analysis and visualization.