Introduction

Regular expressions, or regex, are incredibly powerful tools for pattern matching and extracting specific information from text data. Today, we’ll explore how to harness the might of regex in R with a practical example.

Let’s dive into a scenario where we have data that needs cleaning and extracting numerical values from strings. Our data, stored in a dataframe named df, consists of four columns (x1, x2, x3, x4) with strings containing numerical values along with percentage values enclosed in parentheses. Our goal is to extract these numerical values and compute a total for each row.

Loading Libraries

Before we begin, we need to load the necessary libraries. We’ll be using the tidyverse package for data manipulation, along with glue and unglue for string manipulation.

# Library Loading
pacman::p_load(tidyverse, glue, unglue)

Exploring the Data

Let’s take a sneak peek at our data using the head() function to understand its structure.

df <- tibble(
  x1 = rep("Unit A", 11),
  x2 = c(glue("{11:20} ({1:10}%)"),  glue("{251} ({13}%)")),
  x3 = c(glue("{21:30} ({11:20}%)"), glue("{252} ({14}%)")),
  x4 = c(glue("{31:40} ({21:30}%)"), glue("{253} ({15}%)"))
)

head(df, 3)

# A tibble: 3 × 4
  x1     x2      x3       x4      
  <chr>  <chr>   <chr>    <chr>   
1 Unit A 11 (1%) 21 (11%) 31 (21%)
2 Unit A 12 (2%) 22 (12%) 32 (22%)
3 Unit A 13 (3%) 23 (13%) 33 (23%)

This command displays the first three rows of our dataframe df, giving us an idea of how our data looks like.

Creating a Regex Function

Now, we’ll define a custom function named reg_val_fns to extract numerical values from strings using regular expressions. This function takes two parameters: .col_data (column data) and .pattern (regex pattern). If no pattern is provided, it defaults to extracting any sequence of digits followed by non-word characters or the end of the string.

# Make regex function
reg_val_fns <- function(.col_data, .pattern = NULL){
  ptrn <- .pattern
  if(is.null(ptrn)){
    ptrn <- "\\d+(?=\\W|$)"
  }
  
  reged_val <- .col_data |>
    str_extract(ptrn) |>
    as.numeric()

  return(reged_val)
}

Applying the Regex Function

With our regex function defined, we apply it across desired columns using the mutate(across()) function from the dplyr package. This extracts numerical values from strings in each column, converting them into numeric format. Additionally, we compute the total value for each row using rowSums().

# Apply the function across the desired columns
df |>
  mutate(across(-x1, reg_val_fns)) |>
  mutate(total_val = rowSums(across(-x1)))

# A tibble: 11 × 5
   x1        x2    x3    x4 total_val
   <chr>  <dbl> <dbl> <dbl>     <dbl>
 1 Unit A    11    21    31        63
 2 Unit A    12    22    32        66
 3 Unit A    13    23    33        69
 4 Unit A    14    24    34        72
 5 Unit A    15    25    35        75
 6 Unit A    16    26    36        78
 7 Unit A    17    27    37        81
 8 Unit A    18    28    38        84
 9 Unit A    19    29    39        87
10 Unit A    20    30    40        90
11 Unit A   251   252   253       756

Alternative Approach: Using unglue

An alternative method to extract values from strings is using the unglue package. Here, we apply the unglue_data() function across columns (excluding x1) to extract values and percentages separately, then unnest the resulting dataframe and compute the total value for each row.

# Use unglue
df |>
  mutate(across(-x1, \(x) unglue_data(x, "{val} ({pct}%)"))) |> 
  unnest(cols = everything(), names_sep = "_") |>
  mutate(across(.cols = contains("val"), \(x) as.numeric(x))) |>
  mutate(total_val = rowSums(across(where(is.numeric))))

# A tibble: 11 × 8
   x1     x2_val x2_pct x3_val x3_pct x4_val x4_pct total_val
   <chr>   <dbl> <chr>   <dbl> <chr>   <dbl> <chr>      <dbl>
 1 Unit A     11 1          21 11         31 21            63
 2 Unit A     12 2          22 12         32 22            66
 3 Unit A     13 3          23 13         33 23            69
 4 Unit A     14 4          24 14         34 24            72
 5 Unit A     15 5          25 15         35 25            75
 6 Unit A     16 6          26 16         36 26            78
 7 Unit A     17 7          27 17         37 27            81
 8 Unit A     18 8          28 18         38 28            84
 9 Unit A     19 9          29 19         39 29            87
10 Unit A     20 10         30 20         40 30            90
11 Unit A    251 13        252 14        253 15           756

Conclusion

In this tutorial, we’ve explored how to leverage the power of regular expressions in R to extract numerical values from strings within a dataframe. By defining custom regex functions and using packages like dplyr and unglue, we can efficiently clean and manipulate text data for further analysis.

I encourage you to try out these techniques on your own datasets and explore the endless possibilities of regex in R. Happy coding!