Data Preprocessing Scale/Normalize with {healthyR.ai}

code
rtip
healthyrai
recipes
Author

Steven P. Sanderson II, MPH

Published

November 22, 2022

Introduction

A large portion of data modeling occurrs not only in the data cleaning phase but also in the data preprocessing phase. This can include things like scaling or normalizing data before proceeding to the modeling phase. I will discuss one such function from my r package {healthyR.ai}. In this post I will go over hai_data_scale()

This is a {recipes} style step function and is tidymodels compliant.

Function

Let’s take a look at the function call.

hai_data_scale(
  .recipe_object = NULL,
  ...,
  .type_of_scale = "center",
  .range_min = 0,
  .range_max = 1,
  .scale_factor = 1
)

Now let’s go over the arguments that get supplied to the parameters of this function.

  • .recipe_object - The data that you want to process
  • ... - One or more selector functions to choose variables to be imputed. When used with imp_vars, these dots indicate which variables are used to predict the missing data in each variable. See selections() for more details
  • .type_of_scale - This is a quoted argument and can be one of the following:
    1. “center”
    2. “normalize”
    3. “range”
    4. “scale”
  • range_min - A single numeric value for the smallest value in the range. This defaults to 0.
  • .range_max - A single numeric value for the largeest value in the range. This defaults to 1.
  • .scale_factor - A numeric value of either 1 or 2 that scales the numeric inputs by one or two standard deviations. By dividing by two standard deviations, the coefficients attached to continuous predictors can be interpreted the same way as with binary inputs. Defaults to 1.

Example

Now let’s see it in action!

library(healthyR.ai)
library(dplyr)
library(recipes)

date_seq <- seq.Date(
  from = as.Date("2013-01-01"), 
  length.out = 100, 
  by = "month"
)

val_seq <- rep(rnorm(10, mean = 6, sd = 2), times = 10)

df_tbl <- tibble(
  date_col = date_seq,
  value    = val_seq
)

df_tbl
# A tibble: 100 × 2
   date_col   value
   <date>     <dbl>
 1 2013-01-01  6.66
 2 2013-02-01  6.66
 3 2013-03-01  5.09
 4 2013-04-01  6.94
 5 2013-05-01  5.96
 6 2013-06-01  6.18
 7 2013-07-01  3.62
 8 2013-08-01  7.31
 9 2013-09-01  4.58
10 2013-10-01  7.29
# … with 90 more rows
rec_obj <- recipe(value ~ ., df_tbl)

new_rec_obj <- hai_data_scale(
  .recipe_object = rec_obj,
  value,
  .type_of_scale = "center"
)$scale_rec_obj

new_rec_obj %>% 
  get_juiced_data()
# A tibble: 100 × 2
   date_col     value
   <date>       <dbl>
 1 2013-01-01  0.633 
 2 2013-02-01  0.630 
 3 2013-03-01 -0.935 
 4 2013-04-01  0.909 
 5 2013-05-01 -0.0676
 6 2013-06-01  0.149 
 7 2013-07-01 -2.41  
 8 2013-08-01  1.28  
 9 2013-09-01 -1.45  
10 2013-10-01  1.26  
# … with 90 more rows

Voila!