Exploring strsplit() with Multiple Delimiters in R


Steven P. Sanderson II, MPH


April 26, 2024


In data preprocessing and text manipulation tasks, the strsplit() function in R is incredibly useful for splitting strings based on specific delimiters. However, what if you need to split a string using multiple delimiters? This is where strsplit() can really shine by allowing you to specify a regular expression that defines these delimiters. In this blog post, we’ll dive into how you can use strsplit() effectively with multiple delimiters to parse strings in your data.

Understanding strsplit()

The strsplit() function in R is used to split a character vector (or a string) into substrings based on a specified pattern. The general syntax of strsplit() is:

strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
  • x: The character vector or string to be split.
  • split: The delimiter or regular expression to use for splitting.
  • fixed: If TRUE, split is treated as a fixed string rather than a regular expression.
  • perl: If TRUE, split is treated as a Perl-style regular expression.
  • useBytes: If TRUE, the matching is byte-based rather than character-based.

Splitting with Multiple Delimiters

To split a string using multiple delimiters, we can leverage the power of regular expressions within strsplit(). Regular expressions allow us to define complex patterns that can match various types of strings.

Let’s say we have the following string that contains different types of delimiters: space, comma, and hyphen:

text <- "apple,orange banana -grape pineapple"

We want to split this string into individual words based on the delimiters ,, , and -. Here’s how we can achieve this using strsplit():

result <- strsplit(text, "[,\\s-]+")
[1] "apple"           "orange banana "  "grape pineapple"

In this example: - [ and ] define a character class. - ,, \\s, and - inside the character class specify the delimiters we want to use for splitting. - + after the character class means “one or more occurrences”.

Examples with Different Delimiters

Let’s explore a few more examples to understand how strsplit() handles different scenarios:

Example 1: Splitting with Numbers as Delimiters

text <- "Hello123world456R789users"
result <- strsplit(text, "[0-9]+")

In this case, we use [0-9]+ to split the string wherever there are one or more consecutive digits. The result will be:

[1] "Hello" "world" "R"     "users"

Example 2: Splitting URLs

url <- "https://www.example.com/path/to/page.html"
result <- strsplit(url, "[:/\\.]")

Here, we split the URL based on :, /, and . characters. The result will be:

 [1] "https"   ""        ""        "www"     "example" "com"     "path"   
 [8] "to"      "page"    "html"   

Your Turn to Experiment

The best way to truly understand and harness the power of strsplit() with multiple delimiters is to experiment with different strings and patterns. Try splitting strings using various combinations of characters and observe how strsplit() behaves.

By mastering strsplit() and regular expressions, you can efficiently preprocess and manipulate textual data in R, making your data analysis tasks more effective and enjoyable.

So, why not give it a try? Experiment with strsplit() and multiple delimiters on your own datasets to see how this versatile function can streamline your data cleaning workflows. If you want a really good cheat sheet of regular expressions then check out this one from the stringr package from Posit.

Happy coding!