# Mastering Quantile Normalization in R: A Step-by-Step Guide

code
rtip
operations
Author

Steven P. Sanderson II, MPH

Published

March 28, 2024

# Introduction

Quantile normalization is a crucial technique in data preprocessing, especially in fields like genomics and bioinformatics. It ensures that the distributions of different samples are aligned, making them directly comparable. In this tutorial, we’ll walk through the process step by step, demystifying the syntax and empowering you to apply this technique confidently in your projects.

# Understanding Quantile Normalization

Before we dive into the code, let’s understand the concept behind quantile normalization. At its core, quantile normalization aims to equalize the distributions of multiple datasets by aligning their quantiles. This ensures that each dataset has the same distribution of values, making meaningful comparisons possible.

# Example

First things first, you’ll need some data to work with. For the sake of this tutorial, let’s say you have a dataframe called `df` containing your datasets.

``````set.seed(42)  # For reproducibility
df <- data.frame(
sample1 = rnorm(100, mean = 5, sd = 2),
sample2 = rnorm(100, mean = 10, sd = 1),
sample3 = rnorm(100)
)

``````   sample1   sample2    sample3
1 7.741917 11.200965 -2.0009292
2 3.870604 11.044751  0.3337772
3 5.726257  8.996791  1.1713251
4 6.265725 11.848482  2.0595392
5 5.808537  9.333227 -1.3768616
6 4.787751 10.105514 -1.1508556``````
``````hist(df\$sample1, col = 'red', xlim=c(min(df), max(df)),
main = 'Distribution of Sample 1')
hist(df\$sample2, col = 'blue', add = TRUE)
hist(df\$sample3, col = 'green', add = TRUE)
legend('topright',
c('Sample 1', 'Sample 2','Sample 3'),
fill=c('red','blue', 'green'))``````

## Step 2: Perform Quantile Normalization

Now, it’s time to perform quantile normalization using R’s built-in `quantile()` function. This function calculates quantiles for a given dataset, which is essential for aligning the distributions. Function from: https://lifewithdata.com/2023/09/02/how-to-perform-quantile-normalization-in-r/

``````# Perform quantile normalization
qn <- function(.data){
data_sort <- apply(.data, 2, sort)
row_means <- rowMeans(data_sort)
data_sort <- matrix(row_means,
nrow = nrow(data_sort),
ncol = ncol(data_sort),
byrow = TRUE
)
index_rank <- apply(.data, 2, order)
normalized_data <- matrix(nrow = nrow(.data), ncol = ncol(.data))
for(i in 1:ncol(.data)){
normalized_data[,i] <- data_sort[index_rank[,i], i]
}
return(normalized_data)
}

normalized_data <- qn(df)``````

Let’s break down this code snippet:

Absolutely, let’s break down this R code block piece by piece:

1. Function Definition:

``````qn <- function(.data){
# ... function body here ...
}``````

This defines a function named `qn` that takes a data frame (`data`) as input. This data frame is most likely your dataset you want to normalize.

2. Sorting Each Column:

``data_sort <- apply(.data, 2, sort)``

This line sorts each column of the data frame `data` independently. Imagine sorting rows of data like sorting words in a dictionary. Here, we are sorting each column (each variable) from smallest to largest values. The result is stored in `data_sort`.

3. Calculating Row Means:

``row_means <- rowMeans(data_sort)``

This line calculates the average value for each row in the sorted data frame (`data_sort`). So, for each row (each data point), it finds the mean of the sorted values across all variables. The result is stored in `row_means`.

4. Replicating Row Means into a Matrix:

``````data_sort <- matrix(row_means,
nrow = nrow(data_sort),
ncol = ncol(data_sort),
byrow = TRUE
)``````

This part is a bit trickier. It creates a new matrix (`data_sort`) with the same dimensions (number of rows and columns) as the original sorted data. Then, it fills each row of this new matrix with the corresponding row mean calculated earlier (`row_means`). The `byrow` argument ensures this replication happens row-wise.

5. Ranking Each Value’s Position:

``index_rank <- apply(.data, 2, order)``

Similar to sorting, this line assigns a rank (position) to each value within its column (variable) in the original data frame (`data`). Imagine a race where the first place gets rank 1, second place gets rank 2, and so on. Here, the rank indicates the original position of each value after everything was sorted in step 2. The result is stored in `index_rank`.

6. Building the Normalized Data Frame:

``normalized_data <- matrix(nrow = nrow(.data), ncol = ncol(.data))``

This line creates an empty matrix (`normalized_data`) with the same dimensions as the original data frame. This will eventually hold the normalized data.

7. Looping Through Columns and Assigning Ranked Values:

``````for(i in 1:ncol(.data)){
normalized_data[,i] <- data_sort[index_rank[,i], i]
}``````

This is the core of the normalization process. It loops through each column (variable) of the original data frame (`data`). For each column, it uses the ranks (`index_rank`) as indices to pick values from the sorted data with row means (`data_sort`). Basically, it replaces each value in the original data with the value from the sorted data that has the same rank (original position). This effectively replaces the original values with their corresponding row means (representing the center point) based on their original order.

8. Returning the Normalized Data:

``return(normalized_data)``

Finally, the function returns the `normalized_data` matrix, which contains the quantile normalized version of your original data frame.

In essence, this code performs a type of rank-based normalization where each value is replaced with the row mean that corresponds to its original position after sorting all the data together. This approach ensures that the distribution of values across columns becomes more consistent.

## Step 3: Explore the Results

After quantile normalization, you’ll have a list of normalized datasets ready for further analysis. Take a moment to explore the results and ensure that the distributions are aligned as expected.

``summary(df)``
``````    sample1           sample2          sample3
Min.   :-0.9862   Min.   : 7.975   Min.   :-2.69993
1st Qu.: 3.7666   1st Qu.: 9.409   1st Qu.:-0.71167
Median : 5.1796   Median : 9.931   Median :-0.02474
Mean   : 5.0650   Mean   : 9.913   Mean   :-0.01037
3rd Qu.: 6.3231   3rd Qu.:10.462   3rd Qu.: 0.65254
Max.   : 9.5733   Max.   :12.702   Max.   : 2.45959  ``````
``````# Explore the results
summary(normalized_data)``````
``````       V1              V2              V3
Min.   :1.430   Min.   :1.430   Min.   :1.430
1st Qu.:4.154   1st Qu.:4.154   1st Qu.:4.154
Median :5.029   Median :5.029   Median :5.029
Mean   :4.989   Mean   :4.989   Mean   :4.989
3rd Qu.:5.812   3rd Qu.:5.812   3rd Qu.:5.812
Max.   :8.245   Max.   :8.245   Max.   :8.245  ``````

## Step 4: Obtain Quantiles

Now that the data is normalized, we can extract the quantiles to compare the distributions across datasets. This will help you confirm that the normalization process was successful.

``````as.data.frame(normalized_data) |>
sapply(function(x) quantile(x, probs = seq(0,1,1/4)))``````
``````           V1       V2       V3
0%   1.429737 1.429737 1.429737
25%  4.154481 4.154481 4.154481
50%  5.028521 5.028521 5.028521
75%  5.812480 5.812480 5.812480
100% 8.244925 8.244925 8.244925``````

As we can see, the quantiles of the normalized data are consistent across the different datasets. This indicates that the distributions have been aligned through quantile normalization.

Let’s visuzlize for another confirmation

``````df_normalized <- as.data.frame(normalized_data)

hist(df_normalized\$V1, col = 'red')
hist(df_normalized\$V2, col = 'blue', add = TRUE)
hist(df_normalized\$V3, col = 'green', add = TRUE)

legend('topright', c('Sample 1', 'Sample 2','Sample 3'), fill=c('red','blue', 'green'))``````

# Wrapping Up

Congratulations! You’ve successfully mastered quantile normalization in R. By understanding the underlying concept and applying the `quantile()` function effectively, you can ensure that your datasets are comparable and ready for downstream analysis.

I encourage you to experiment with different datasets and explore the impact of quantile normalization on your analyses. Remember, practice makes perfect, so don’t hesitate to try it out on your own data. Happy coding!