Introducing check_duplicate_rows() from TidyDensity


Steven P. Sanderson II, MPH


May 1, 2024


Today, we’re diving into a useful new function from the TidyDensity R package: check_duplicate_rows(). This function is designed to efficiently identify duplicate rows within a data frame, providing a logical vector that flags each row as either a duplicate or unique. Let’s explore how this function works and see it in action with some illustrative examples.

Understanding check_duplicate_rows()

The check_duplicate_rows() function takes a single argument, .data, which should be a data frame. It then compares each row of the data frame to every other row to identify duplicates based on complete row matches.



Let’s start by demonstrating how this function operates with two scenarios: one where there are no duplicate rows, and another where there are duplicate rows with identical values in specific columns.

Example 1: No Duplicates

First, let’s create a data frame where all rows are unique. We’ll use the iris dataset for this example:

# Load required libraries

# Create a data frame (iris dataset)
data_no_duplicates <- iris

# Check for duplicate rows
duplicates <- check_duplicate_rows(data_no_duplicates)

# View the result

In this case, the duplicates vector will contain only FALSE values, indicating that no rows in iris are exact duplicates of each other.

Example 2: Duplicate Rows

Next, let’s create a scenario where some rows contain identical values in specific columns. We’ll manually construct a data frame for this purpose:

# Create a data frame with duplicate rows
data_with_duplicates <- data.frame(
  Name = c("John", "Alice", "John", "Bob", "Alice","David"),
  Age = c(25, 30, 25, 40, 30, 50),
  Score = c(85, 90, 85, 75, 90, 50)

# Check for duplicate rows
duplicates <- check_duplicate_rows(data_with_duplicates)

# View the result

In this example, the duplicates vector will indicate which rows are duplicates (TRUE for duplicates, FALSE for unique rows). You’ll notice that the last row is flagged as a duplicate because there is the same value for the Age and Score columns.


The check_duplicate_rows() function in the TidyDensity package is a handy tool for identifying duplicate rows within a data frame. It can be particularly useful for data cleaning and quality assurance tasks, ensuring that datasets are free from unintended duplicates that could skew analysis results.

If you work with data frames and want a straightforward way to detect duplicate rows efficiently, consider incorporating check_duplicate_rows() into your R workflow with TidyDensity. This function exemplifies the package’s commitment to providing practical, user-friendly tools for data manipulation and analysis.

That wraps up our overview of check_duplicate_rows(). We hope you find this function useful in your data analysis endeavors! If you have any questions or feedback, feel free to reach out in the comments below. Until next time, happy coding with R!