# Taming the Data Jungle: Filtering data.tables and data.frames in R

code
rtip
operations
Author

Steven P. Sanderson II, MPH

Published

February 23, 2024

# Introduction

Ah, data! The lifeblood of many an analysis, but sometimes it can feel like you’re lost in a tangled jungle. Thankfully, R offers powerful tools to navigate this data wilderness, and filtering is one of the most essential skills in your arsenal. Today, we’ll explore how to filter both data.tables and data.frames, making your data exploration a breeze!

# Filtering data.tables: Precise and Powerful

data.tables, brought to you by the `data.table` package, are known for their speed and efficiency. Here’s how to filter them:

# Examples

## Example 1. Filtering by a single condition:

``````# Sample data.table
library(data.table)
mtcars_dt <- as.data.table(mtcars)

# Filter cars with MPG greater than 25
filtered_cars <- mtcars_dt[mpg > 25]
filtered_cars``````
``````     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
2:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
3:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
4:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
5:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
6:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2``````

Explanation:

• `mtcars_dt[mpg > 25]` selects rows where the `mpg` column is greater than 25.
• The result, stored in `filtered_cars`, is a new data.table containing only those rows.

## Example 2. Combining conditions:**

``````# Filter cars with 4 cylinders and horsepower over 150
filtered_cars <- mtcars_dt[(cyl == 4) & (hp > 150)]
filtered_cars``````
``Empty data.table (0 rows and 11 cols): mpg,cyl,disp,hp,drat,wt...``

Explanation:

• `(cyl == 4) & (hp > 150)` combines two conditions using the `&` operator (AND).
• Only rows meeting both conditions are included in the filtered data.table.

## Example 3. Filtering by values in a list:

``````# Filter cars with carb in 1 or 2
filtered_cars <- mtcars_dt[carb %in% c(1, 2)]
filtered_cars``````
``````      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
2:  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
3:  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
4:  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
5:  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
6:  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
7:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
8:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
9:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
10:  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1
11:  15.5     8 318.0   150  2.76 3.520 16.87     0     0     3     2
12:  15.2     8 304.0   150  3.15 3.435 17.30     0     0     3     2
13:  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
14:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
15:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
16:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
17:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2``````

Explanation:

• `%in%` checks if a value belongs to a list.
• Here, we filter for cars where the `carb` is either 1 or 2.

# Filtering data.frames: Familiar and Flexible

data.frames are the workhorses of R. Here’s how to filter them:

## Example 1. Filtering with logical operators:

``````# Filter irises with Sepal.Length less than 5 and Petal.Width greater than 2
filtered_iris <- iris[iris\$Sepal.Length < 5 & iris\$Petal.Width > 2,]
filtered_iris``````
``````[1] Sepal.Length Sepal.Width  Petal.Length Petal.Width  Species
<0 rows> (or 0-length row.names)``````

Explanation:

• This approach is similar to data.tables, using logical operators (`<`, `>`, `&`) to define conditions.
• The filtered data.frame is stored in `filtered_iris`.

## Example 2. Subsetting with row indices:

``````# Filter the first 3 and last 2 rows
filtered_iris <- iris[1:3, ] # First 3 rows
filtered_iris``````
``````  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa``````
``````filtered_iris <- iris[nrow(iris) - 0:1, ] # Last 2 rows
filtered_iris``````
``````    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
150          5.9         3.0          5.1         1.8 virginica
149          6.2         3.4          5.4         2.3 virginica``````

Explanation:

• You can directly specify row indices within square brackets `[]`.
• This is useful for selecting specific rows based on their position.

Bonus Tip: Don’t forget to explore the `dplyr` package for even more powerful filtering options!