Taming the Data Jungle: Filtering data.tables and data.frames in R

code
rtip
operations
Author

Steven P. Sanderson II, MPH

Published

February 23, 2024

Introduction

Ah, data! The lifeblood of many an analysis, but sometimes it can feel like you’re lost in a tangled jungle. Thankfully, R offers powerful tools to navigate this data wilderness, and filtering is one of the most essential skills in your arsenal. Today, we’ll explore how to filter both data.tables and data.frames, making your data exploration a breeze!

Filtering data.tables: Precise and Powerful

data.tables, brought to you by the data.table package, are known for their speed and efficiency. Here’s how to filter them:

Examples

Example 1. Filtering by a single condition:

# Sample data.table
library(data.table)
mtcars_dt <- as.data.table(mtcars)

# Filter cars with MPG greater than 25
filtered_cars <- mtcars_dt[mpg > 25]
filtered_cars
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
2:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
3:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
4:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
5:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
6:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2

Explanation:

  • mtcars_dt[mpg > 25] selects rows where the mpg column is greater than 25.
  • The result, stored in filtered_cars, is a new data.table containing only those rows.

Example 2. Combining conditions:**

# Filter cars with 4 cylinders and horsepower over 150
filtered_cars <- mtcars_dt[(cyl == 4) & (hp > 150)]
filtered_cars
Empty data.table (0 rows and 11 cols): mpg,cyl,disp,hp,drat,wt...

Explanation:

  • (cyl == 4) & (hp > 150) combines two conditions using the & operator (AND).
  • Only rows meeting both conditions are included in the filtered data.table.

Example 3. Filtering by values in a list:

# Filter cars with carb in 1 or 2
filtered_cars <- mtcars_dt[carb %in% c(1, 2)]
filtered_cars
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
 1:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 2:  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
 3:  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
 4:  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
 5:  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
 6:  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
 7:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
 8:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
 9:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
10:  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1
11:  15.5     8 318.0   150  2.76 3.520 16.87     0     0     3     2
12:  15.2     8 304.0   150  3.15 3.435 17.30     0     0     3     2
13:  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
14:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
15:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
16:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
17:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2

Explanation:

  • %in% checks if a value belongs to a list.
  • Here, we filter for cars where the carb is either 1 or 2.

Filtering data.frames: Familiar and Flexible

data.frames are the workhorses of R. Here’s how to filter them:

Example 1. Filtering with logical operators:

# Filter irises with Sepal.Length less than 5 and Petal.Width greater than 2
filtered_iris <- iris[iris$Sepal.Length < 5 & iris$Petal.Width > 2,]
filtered_iris
[1] Sepal.Length Sepal.Width  Petal.Length Petal.Width  Species     
<0 rows> (or 0-length row.names)

Explanation:

  • This approach is similar to data.tables, using logical operators (<, >, &) to define conditions.
  • The filtered data.frame is stored in filtered_iris.

Example 2. Subsetting with row indices:

# Filter the first 3 and last 2 rows
filtered_iris <- iris[1:3, ] # First 3 rows
filtered_iris
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
filtered_iris <- iris[nrow(iris) - 0:1, ] # Last 2 rows
filtered_iris
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
150          5.9         3.0          5.1         1.8 virginica
149          6.2         3.4          5.4         2.3 virginica

Explanation:

  • You can directly specify row indices within square brackets [].
  • This is useful for selecting specific rows based on their position.

Ready to Explore?

Now that you’re equipped with these filtering techniques, dive into your own data! Try practicing on different datasets and experiment with combining conditions. Remember, the more you practice, the more comfortable you’ll become navigating the data jungle.

Bonus Tip: Don’t forget to explore the dplyr package for even more powerful filtering options!