# Quantile Normalization in R with the {TidyDensity} Package

code
rtip
tidydensity
Author

Steven P. Sanderson II, MPH

Published

April 30, 2024

# Introduction

In data analysis, especially when dealing with multiple samples or distributions, ensuring comparability and removing biases is crucial. One powerful technique for achieving this is quantile normalization. This method aligns the distributions of values across different samples, making them more similar in terms of their statistical properties.

# What is Quantile Normalization?

Quantile normalization is a statistical method used to adjust the distributions of values in different datasets so that they have similar quantiles. This technique is particularly valuable when working with high-dimensional data, such as gene expression data or other omics datasets, where ensuring comparability across samples is essential.

# Introducing `quantile_normalize()` in TidyDensity

The `quantile_normalize()` function is a new addition to the TidyDensity package, designed to simplify the process of quantile normalization within R. Let’s delve into how this function works and how you can integrate it into your data analysis pipeline.

# Function Usage

The `quantile_normalize()` function takes a numeric matrix as input, where each column represents a sample. Here’s a breakdown of its usage:

``quantile_normalize(.data, .return_tibble = FALSE)``
• `.data`: A numeric matrix where each column corresponds to a sample that requires quantile normalization.
• `.return_tibble`: A logical value (default: `FALSE`) indicating whether the output should be returned as a tibble.

# Understanding the Output

When you apply `quantile_normalize()` to your data, you receive a list object containing the following components:

1. Quantile-Normalized Matrix: A numeric matrix where each column has been quantile-normalized.
2. Row Means: The means of each row across the quantile-normalized matrix.
3. Sorted Data: The sorted values used during the quantile normalization process.
4. Ranked Indices: The indices of the sorted values.

# How Quantile Normalization Works

The `quantile_normalize()` function performs quantile normalization through the following steps:

1. Sorting: Each column of the input matrix is sorted.
2. Row Mean Calculation: The mean of each row across the sorted columns is computed.
3. Normalization: Each column’s sorted values are replaced with the corresponding row means.
4. Unsorting: The columns are restored to their original order, ensuring that the quantile-normalized matrix maintains the same structure as the input.

# Examples

Let’s demonstrate the usage of `quantile_normalize()` with a simple example:

``````# Load TidyDensity
library(TidyDensity)

# Create a sample matrix
set.seed(123)
data <- matrix(rnorm(50), ncol = 4)
``````            [,1]       [,2]       [,3]       [,4]
[1,] -0.56047565  0.1106827  0.8377870 -0.3804710
[2,] -0.23017749 -0.5558411  0.1533731 -0.6947070
[3,]  1.55870831  1.7869131 -1.1381369 -0.2079173
[4,]  0.07050839  0.4978505  1.2538149 -1.2653964
[5,]  0.12928774 -1.9666172  0.4264642  2.1689560``````
``````# Apply quantile normalization
result <- quantile_normalize(data)

# Access the quantile-normalized matrix
normalized_matrix <- result[["normalized_data"]]

# View the normalized matrix
``````            [,1]       [,2]        [,3]       [,4]
[1,] -0.65451945 -0.3180877  0.84500772 -0.6545195
[2,] -0.06327669  0.8450077  1.09078797 -0.9506544
[3,] -1.40880292 -0.5235134  0.33150422  0.0863713
[4,]  0.84500772  1.0907880  0.08637130  0.1991151
[5,] -0.31808774 -0.6545195 -0.06327669  0.3315042``````

Let’s now look at the rest of the output components:

``head(result[["row_means"]], 5)``
``[1] -1.4088029 -0.9506544 -0.6545195 -0.5235134 -0.3180877``
``head(result[["duplicated_ranks"]], 5)``
``````     [,1] [,2] [,3] [,4]
[1,]    9   13   13    7
[2,]   10   10   12   12
[3,]    2   11    2    9
[4,]   13    9    9    3
[5,]    7    1    1   11``````
``head(result[["duplicated_rank_row_indicies"]], 5)``
``NULL``
``head(result[["duplicated_rank_data"]], 5)``
``````            [,1]       [,2]      [,3]       [,4]
[1,] -0.23017749 -0.5558411 0.1533731 -0.6947070
[2,]  0.07050839  0.4978505 1.2538149 -1.2653964
[3,]  0.12928774 -1.9666172 0.4264642  2.1689560
[4,] -0.68685285 -0.2179749 0.8215811 -0.4666554
[5,] -0.44566197 -1.0260044 0.6886403  0.7799651``````

Now, lets take a look at the before and after quantile normalization summary:

``````as.data.frame(data) |>
sapply(function(x) quantile(x, probs = seq(0, 1, 1/4)))``````
``````             V1         V2          V3          V4
0%   -1.2650612 -1.9666172 -1.13813694 -1.26539635
25%  -0.4456620 -1.0260044 -0.06191171 -0.56047565
50%   0.1292877 -0.5558411  0.55391765 -0.38047100
75%   0.4609162  0.1106827  0.83778704 -0.08336907
100%  1.7150650  1.7869131  1.25381492  2.16895597``````
``````as.data.frame(normalized_matrix) |>
sapply(function(x) quantile(x, probs = seq(0, 1, 1/4)))``````
``````              V1          V2          V3          V4
0%   -1.40880292 -1.40880292 -1.40880292 -1.40880292
25%  -0.52351344 -0.52351344 -0.52351344 -0.52351344
50%  -0.06327669 -0.06327669 -0.06327669 -0.06327669
75%   0.33150422  0.33150422  0.33150422  0.33150422
100%  1.73118725  1.73118725  1.73118725  1.73118725``````

Now let’s use the `.return_tibble` argument to return the output as a tibble:

``quantile_normalize(data, .return_tibble = TRUE)``
``````\$normalized_data
# A tibble: 13 × 4
V1      V2      V3      V4
<dbl>   <dbl>   <dbl>   <dbl>
1 -0.655  -0.318   0.845  -0.655
2 -0.0633  0.845   1.09   -0.951
3 -1.41   -0.524   0.332   0.0864
4  0.845   1.09    0.0864  0.199
5 -0.318  -0.655  -0.0633  0.332
6  1.73   -0.0633 -0.133  -0.133
7 -0.524  -0.133  -0.524  -0.524
8 -0.133   1.73    1.73    1.73
9  0.332   0.0864  0.199   1.09
10  1.09   -0.951  -0.655  -0.318
11 -0.951  -1.41   -0.318  -1.41
12  0.199   0.199  -1.41    0.845
13  0.0864  0.332  -0.951  -0.0633

\$row_means
# A tibble: 13 × 1
value
<dbl>
1 -1.41
2 -0.951
3 -0.655
4 -0.524
5 -0.318
6 -0.133
7 -0.0633
8  0.0864
9  0.199
10  0.332
11  0.845
12  1.09
13  1.73

\$duplicated_ranks
# A tibble: 6 × 4
V1    V2    V3    V4
<int> <int> <int> <int>
1     9    13    13     7
2    10    10    12    12
3     2    11     2     9
4    13     9     9     3
5     7     1     1    11
6     3     6     7     6

\$duplicated_rank_row_indices
# A tibble: 6 × 1
row_index
<int>
1         2
2         4
3         5
4         9
5        10
6        12

\$duplicated_rank_data
# A tibble: 6 × 4
V1     V2      V3     V4
<dbl>  <dbl>   <dbl>  <dbl>
1 -0.230  -0.556  0.153  -0.695
2  0.0705  0.498  1.25   -1.27
3  0.129  -1.97   0.426   2.17
4 -0.687  -0.218  0.822  -0.467
5 -0.446  -1.03   0.689   0.780
6  0.360  -0.625 -0.0619 -0.560``````

### Conclusion

In summary, the `quantile_normalize()` function from the TidyDensity package offers a convenient and efficient way to perform quantile normalization on numeric matrices in R. By leveraging this function, you can enhance the comparability and statistical integrity of your data across multiple samples or distributions. Incorporate `quantile_normalize()` into your data preprocessing workflow to unlock deeper insights and more robust analyses.

To explore more functionalities of TidyDensity and leverage its capabilities for advanced data analysis tasks, check out the package documentation and experiment with different parameters and options provided by the `quantile_normalize()` function.