How to Remove Duplicate Rows in R: A Comprehensive Guide

Learn how to remove duplicate rows in R using base R, dplyr, and data.table methods. Comprehensive guide with practical examples and performance comparisons for R programmers.
code
rtip
Author

Steven P. Sanderson II, MPH

Published

January 28, 2025

Keywords

Programming, remove duplicate rows R, distinct rows R, unique rows dataframe, dplyr remove duplicates, data.table deduplicate, R duplicate detection, unique function R, distinct() dplyr, remove duplicate observations, R data cleaning, duplicate handling R, efficient deduplication R, Remove duplicates in R, R programming, Data cleaning in R, R data manipulation, R data analysis, dplyr distinct function, Base R unique function, data.table in R, handling duplicate data, R data frames, How to remove duplicate rows in R using dplyr, Best methods for data cleaning in R programming, Step-by-step guide to removing duplicates in R, Efficiently handle duplicate data in R data frames, Comparing base R and dplyr for removing duplicates in R

Introduction

Dealing with duplicate rows is a common challenge in data analysis and cleaning. This comprehensive guide will show you how to effectively remove duplicate rows in R using multiple approaches, including base R, dplyr, and data.table methods.

Understanding Duplicate Rows

Duplicate rows are identical observations that appear multiple times in your dataset. They can arise from various sources, such as:

  • Data entry errors
  • Multiple data imports
  • System-generated duplicates
  • Merged datasets

Method 1: Base R Approach

Using unique()

df <- data.frame(
  id = c(1,1,2,2,3),
  value = c(10,10,20,30,40)
)
# Remove all duplicate rows
df_unique <- unique(df)
print(df_unique)
  id value
1  1    10
3  2    20
4  2    30
5  3    40
# Remove duplicates based on specific columns
df_unique <- df[!duplicated(df[c("id","value")]), ]
print(df_unique)
  id value
1  1    10
3  2    20
4  2    30
5  3    40

The base R approach uses the duplicated() function, which returns a logical vector identifying duplicated rows with TRUE or FALSE. This method is straightforward but may not be the most efficient for large datasets.

Method 2: dplyr Solution

Using distinct()

library(dplyr)

# Remove all duplicate rows
df_unique <- df %>% distinct()
print(df_unique)
  id value
1  1    10
2  2    20
3  2    30
4  3    40
# Remove duplicates based on specific columns
df_unique <- df %>% distinct(id, value, .keep_all = TRUE)
print(df_unique)
  id value
1  1    10
2  2    20
3  2    30
4  3    40

The dplyr package’s distinct() function is highly recommended for its efficiency and clarity. For larger datasets, dplyr methods perform approximately 30% faster than base R approaches, as they utilize C++ code for evaluation.

Method 3: data.table Approach

library(data.table)

# Convert to data.table
dt <- as.data.table(df)
print(dt)
      id value
   <num> <num>
1:     1    10
2:     1    10
3:     2    20
4:     2    30
5:     3    40
# Remove duplicates
dt_unique <- unique(dt)
print(dt_unique)
      id value
   <num> <num>
1:     1    10
2:     2    20
3:     2    30
4:     3    40

Working with Multiple Columns

To remove duplicates based on specific columns:

# Using dplyr
df %>% 
  distinct(id, .keep_all = TRUE)
  id value
1  1    10
2  2    20
3  3    40
# Using base R
df[!duplicated(df$id), ]
  id value
1  1    10
3  2    20
5  3    40

Best Practices

  1. Choose the right method:
    • For small datasets: Base R is sufficient
    • For large datasets: Use dplyr or data.table
    • For complex operations: Consider dplyr for readability
  2. Consider performance:
    • Group operations before removing duplicates
    • Index your data when using data.table
    • Monitor memory usage for large datasets

Your Turn!

Try this exercise:

# Create a sample dataset
df <- data.frame(
  id = c(1,1,2,2,3),
  value = c(10,10,20,30,40)
)

# Your task: Remove duplicates based on both id and value
# Write your solution below
Click here for Solution!

Solution:

library(dplyr)
# Using dplyr
result <- df %>% distinct(id, value)
print(result)
  id value
1  1    10
2  2    20
3  2    30
4  3    40
# Using base R
result <- df[!duplicated(df[c("id", "value")]),]
print(result)
  id value
1  1    10
3  2    20
4  2    30
5  3    40

Quick Takeaways

  • Use distinct() from dplyr for most scenarios
  • Consider performance implications for large datasets
  • Always verify results after deduplication
  • Keep all columns with .keep_all = TRUE when needed

FAQs

  1. Q: Which method is fastest for large datasets? A: The dplyr package methods are typically 30% faster for larger datasets.

  2. Q: Can I remove duplicates based on specific columns? A: Yes, all methods (base R, dplyr, and data.table) support column-specific deduplication.

  3. Q: Will removing duplicates affect my row order? A: It might, depending on the method used. Consider adding row numbers if order is important.

  4. Q: How do I keep only the first occurrence of duplicates? A: Use duplicated() with ! operator in base R or distinct() with appropriate arguments in dplyr.

  5. Q: What happens to missing values (NA) during deduplication? A: NAs are treated as equal to other NAs by default in most R functions.

Conclusion

Removing duplicate rows is an essential skill for data cleaning in R. While there are multiple approaches available, the dplyr distinct() function offers the best balance of performance and readability for most use cases. Remember to consider your specific needs regarding performance, readability, and functionality when choosing a method.

Engage!

Share your experiences with these methods in the comments below! Have you found other efficient ways to handle duplicates in R? Let’s discuss!

References

  1. How to Remove Duplicate Rows in R
  2. Remove Duplicate Rows in R - Spark By Examples
  3. Remove Duplicate Rows in R using dplyr - GeeksforGeeks
  4. Identify and Remove Duplicate Data in R - Datanovia

Happy Coding! 🚀

Remove Duplicates

You can connect with me at any one of the below:

Telegram Channel here: https://t.me/steveondata

LinkedIn Network here: https://www.linkedin.com/in/spsanderson/

Mastadon Social here: https://mstdn.social/@stevensanderson

RStats Network here: https://rstats.me/@spsanderson

GitHub Network here: https://github.com/spsanderson

Bluesky Network here: https://bsky.app/profile/spsanderson.com

My Book: Extending Excel with Python and R here: https://packt.link/oTyZJ