# Counting NA Values in Each Column: Comparing Methods in R

code
rtip
operations
Author

Steven P. Sanderson II, MPH

Published

May 7, 2024

# Introduction

Welcome back, R enthusiasts! Today, we’re going to explore a fundamental task in data analysis: counting the number of missing (NA) values in each column of a dataset. This might seem straightforward, but there are different ways to achieve this using different packages and methods in R.

Let’s dive right in and compare how to accomplish this task using base R, dplyr, and data.table. Each method has its own strengths and can cater to different preferences and data handling scenarios.

# Examples

## Using Base R

First up, let’s tackle this using base R functions. We’ll leverage the `colSums()` function along with `is.na()` to count NA values in each column of a dataframe.

``````# Sample dataframe
df <- data.frame(
A = c(1, 2, NA, 4),
B = c(NA, 2, 3, NA),
C = c(1, NA, NA, 4)
)

# Count NA values in each column using base R
na_counts_base <- colSums(is.na(df))
print(na_counts_base)``````
``````A B C
1 2 2 ``````

In this code snippet, `is.na(df)` creates a logical matrix indicating NA positions in `df`. `colSums()` then sums up the TRUE values (which represent NA) across each column, giving us the count of NAs per column. Simple and effective!

## Using Base R (with lapply)

To adapt this method for base R, we can directly apply `lapply()` to the dataframe (`df`) to achieve the same result.

``````# Count NA values in each column using base R and lapply
na_counts_base <- lapply(df, function(x) sum(is.na(x)))

print(na_counts_base)``````
``````\$A
[1] 1

\$B
[1] 2

\$C
[1] 2``````

In this snippet, `lapply(df, function(x) sum(is.na(x)))` applies the function `function(x) sum(is.na(x))` to each column of the dataframe (`df`), resulting in a list of NA counts per column.

## Using dplyr

Now, let’s switch gears and utilize the popular `dplyr` package to achieve the same task in a more streamlined manner.

``````library(dplyr)

# Count NA values in each column using dplyr
na_counts_dplyr <- df %>%
summarise_all(~ sum(is.na(.)))

print(na_counts_dplyr)``````
``````  A B C
1 1 2 2``````

Here, `summarise_all()` from `dplyr` applies the `sum(is.na(.))` function to each column (`.` represents each column in this context), providing us with the count of NA values in each. This approach is clean and fits well into a tidyverse workflow.

## Using data.table

Last but not least, let’s see how to accomplish this using `data.table`, a powerful package known for its efficiency with large datasets.

``````library(data.table)

# Convert dataframe to data.table
dt <- as.data.table(df)

# Count NA values in each column using data.table
na_counts_data_table <- dt[, lapply(.SD, function(x) sum(is.na(x)))]

print(na_counts_data_table)``````
``````       A     B     C
<int> <int> <int>
1:     1     2     2``````

In this snippet, `lapply(.SD, function(x) sum(is.na(x)))` within `data.table` allows us to apply the `sum(is.na())` function to each column (`.SD` represents the Subset of Data for each group, which in this case is each column).

# Which Method to Choose?

Now that we’ve explored three different methods to count NA values in each column, you might be wondering which one to use. The answer depends on your preference, the complexity of your dataset, and the packages you’re comfortable working with.

• Base R is straightforward and doesn’t require additional packages.
• dplyr is excellent for working within the tidyverse, especially if you’re already using other tidy tools.
• data.table shines with large datasets due to its efficiency and syntax.