Introduction

For every R programmer, data wrangling is a daily task, and few things are more frustrating than incomplete, “empty” rows silently contaminating your analysis. While “empty” can mean a row entirely composed of missing values (NAs), it more often refers to a row containing at least one missing value.

Efficiently identifying and purging these incomplete records is crucial for maintaining data integrity. Fortunately, R offers powerful, flexible tools in both Base R and the popular Tidyverse ecosystem to handle this.

In this guide, we’ll go into three core methods; two rooted in Base R and one modern dplyr solution. We’ll put all three techniques to the test using the rbenchmark package to settle the debate on speed and efficiency.

The Base R Standard: Using `complete.cases()`

The most common and often the fastest way to remove rows containing any NA is by leveraging the built-in Base R function, complete.cases(). This method is the workhorse of R data cleaning, offering a balance of speed and simplicity.

Deep Dive into `complete.cases()`

The complete.cases() function returns a logical vector (TRUE or FALSE) for each row in a data frame. TRUE indicates the row contains no missing values, and FALSE means it contains at least one NA.

When used inside the subsetting brackets [ ], it acts as a powerful filter, keeping only the rows where the result is TRUE.

Example Code:

# 1. Setup Sample Data
df_test <- data.frame(
    x = c(10, 20, NA, 40, NA),
    y = c(NA, 50, NA, 80, 90),
    z = c(6, 7, NA, 9, 10)
)
print("Original Data Frame:")

[1] "Original Data Frame:"

print(df_test)

# 2. Method 1: Remove rows with NA in at least one column
# The comma at the end applies the filter to the rows (before the comma is the row index)
df_cleaned_cc <- df_test[complete.cases(df_test), ]
print("Cleaned using complete.cases():")

[1] "Cleaned using complete.cases():"

print(df_cleaned_cc)

   x  y z
2 20 50 7
4 40 80 9

In this example, complete.cases(df_test) returns FALSE for any row that is not fully populated, effectively removing both the row where only column x is missing and the row where all columns are missing.

Addressing Truly Empty Rows with `rowSums()`

While complete.cases() is great for removing incomplete records, sometimes you only want to remove rows that are entirely empty (i.e., every cell in that row is NA). This requires a slightly different approach using rowSums() and is.na().

This method works by:

is.na(df): Converts the data frame into a matrix of logical values (TRUE where there’s an NA).
rowSums(...): Sums the TRUE values (which are treated as 1s) across each row, counting the total number of NAs per row.
!= ncol(df): Filters the rows where the sum of NAs is not equal to the total number of columns. If the row sum of NAs equals the number of columns, the row is 100% empty and is removed.

Example Code:

# Setup data frame where row 3 is fully NA
df_truly_empty <- data.frame(
    x = c(10, NA, NA, 40),
    y = c(50, 60, NA, 80),
    z = c(1, 2, NA, 4)
)
print("Original Data Frame:")

[1] "Original Data Frame:"

print(df_truly_empty)

   x  y  z
1 10 50  1
2 NA 60  2
3 NA NA NA
4 40 80  4

# Method 2: Remove rows where all columns are NA
df_cleaned_rs <- df_truly_empty[rowSums(is.na(df_truly_empty)) != ncol(df_truly_empty), ]
print("Cleaned using rowSums():")

[1] "Cleaned using rowSums():"

print(df_cleaned_rs)

Notice how row 3 is removed, but row 2 (which had only one NA) is retained.

The Tidyverse Approach: `dplyr::drop_na()`

For those who prefer the readable, pipe-friendly syntax of the Tidyverse, the tidyr package offers the concise drop_na() function. This method achieves the same result as complete.cases()—removing all rows with at least one missing value.

The primary advantage here is readability; the intent is immediately clear, especially when chaining multiple data operations.

Example Code:

# Ensure dplyr is loaded
library(tidyr)

# Method 3: Remove rows with NA using dplyr
df_cleaned_tidyr <- df_test |>
    drop_na()
print("Cleaned using tidyr::drop_na():")

[1] "Cleaned using tidyr::drop_na():"

print(df_cleaned_tidyr)

   x  y z
1 20 50 7
2 40 80 9

Performance Matters: Benchmarking Removal Methods

As R programmers, we care about more than just syntax—we care about efficiency. While tidyr is often favored for readability, Base R methods can sometimes offer a performance edge on very large datasets.

We use the rbenchmark package to compare the two common methods for removing incomplete rows: Base R complete.cases() and the tidyr::drop_na() function.

library(rbenchmark)

# Create a large test data frame (10,000 rows)
set.seed(42)
big_df <- as.data.frame(
    matrix(
        sample(c(1:100, NA), 10000 * 5, replace = TRUE, prob = c(rep(0.95/100, 100), 0.05)),
        ncol = 5
    )
)

# Benchmark the two common methods
benchmark_results <- benchmark(
  BaseR_Complete = big_df[complete.cases(big_df), ],
  tidyr_DropNa = big_df |> drop_na(),
  replications = 1000,
  columns = c("test", "replications", "elapsed", "relative")
)

print(benchmark_results[order(benchmark_results$relative), ])

            test replications elapsed relative
2   tidyr_DropNa         1000    0.44    1.000
1 BaseR_Complete         1000    1.66    3.773

The benchmarking results consistently show that the Base R complete.cases() method is slower than tidyr::drop_na() when executing the same operation on large data frames. For most day-to-day tasks, this difference is negligible, even though there is a large speedup but for massive datasets or functions running millions of times, tidyr retains a slight performance advantage.

Your Turn!

Imagine you have a data frame of customer survey responses. You want to remove only those rows where both the Satisfaction and Usage columns are missing, allowing rows with NA in only one of them to remain.

Your Task: Write the Base R code to remove rows where Satisfaction and Usage are both NA from the data frame survey_df.

survey_df <- data.frame(
    Customer = 1:5,
    Satisfaction = c(5, NA, 3, NA, 4),
    Usage = c(10, 5, NA, NA, 8)
)

See Solution!

The key is to use is.na() on the specific columns and combine the logical vectors with the AND operator (&):

# Logical vector where both are NA
is_double_na <- is.na(survey_df$Satisfaction) & is.na(survey_df$Usage)

# Filter the data frame to keep rows where 'is_double_na' is FALSE
# The '!' negates the logical vector, keeping rows that are NOT double NA
cleaned_survey_df <- survey_df[!is_double_na, ]

print(cleaned_survey_df)

  Customer Satisfaction Usage
1        1            5    10
2        2           NA     5
3        3            3    NA
5        5            4     8

Key Takeaways

Most Readable: df %>% drop_na() (Tidyverse) offers the clearest, most readable syntax for Tidyverse users.
Targeted Cleaning: You can use complete.cases(df[ , c("col1", "col2")]) to check for completeness in only a subset of columns.
Truly Empty Rows: Use the rowSums(is.na(df)) != ncol(df) method to target and remove only rows that are 100% missing.

Frequently Asked Questions (FAQs)

What is the difference between complete.cases() and na.omit()?
- complete.cases() returns a logical vector (TRUE/FALSE) that you can use for flexible subsetting. na.omit() is a high-level function that directly returns the data frame with all incomplete rows removed. For simple cleaning, the results are functionally the same.
Can I use drop_na() on specific columns?
- Yes. You can pass specific column names to the function: df %>% drop_na(column_a, column_b). This removes rows only if they have an NA in the specified columns.
Why do my empty rows sometimes show up as "" (empty strings) instead of NA?
- This is common when reading messy CSVs. If a cell is truly blank, R might import it as "". You must first convert these empty strings to the official NA missing value before using the functions discussed: df[df == ""] <- NA.
Is one method always better than the others?
- No. For raw performance on massive datasets, Base R often wins. For clear, pipe-friendly code in a Tidyverse project, dplyr::drop_na() is preferred. Choose the method that best fits your coding environment and performance needs.
Does complete.cases() consider rows with NaN?
- No, complete.cases() only checks for NA (Not Available). If you need to include NaN (Not a Number, often used in mathematical operations) as a missing value, you should first convert it to NA or use a custom function.

Conclusion: Choosing Your Cleaning Tool

Removing empty or incomplete rows is a foundational skill in R. Whether you are a performance purist favoring the speed of complete.cases() or a Tidyverse enthusiast prioritizing the clarity of dplyr::drop_na(), R provides the perfect tool for your data cleaning toolkit.

Choosing between them ultimately boils down to balancing performance benchmarks with code readability. Start cleaning your data frames today for more accurate, robust statistical models!

What’s your go-to method? Share your preferred data cleaning function in the comments below!

References

Statology. (n.d.). How to Remove Empty Rows from Data Frame in R. Retrieved from https://www.statology.org/remove-empty-rows-in-r/
Wickham, H., François, R., Henry, L., & Müller, K. (2024). dplyr: A Grammar of Data Manipulation. R package version 1.1.4.
R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Happy Coding! 🚀

You can connect with me at any one of the below:

Telegram Channel here: https://t.me/steveondata

LinkedIn Network here: https://www.linkedin.com/in/spsanderson/

Mastadon Social here: https://mstdn.social/@stevensanderson

RStats Network here: https://rstats.me/@spsanderson

GitHub Network here: https://github.com/spsanderson

Bluesky Network here: https://bsky.app/profile/spsanderson.com

My Book: Extending Excel with Python and R here: https://packt.link/oTyZJ

You.com Referral Link: https://you.com/join/EHSLDTL6