This guide covers some data preprocessing techniques using healthyR.ai, focusing on imputation and scaling functions with clear syntax examples and best practices.

Introduction

Data preprocessing is a necessary step in any machine learning workflow. The healthyR.ai package offers user-friendly functions for imputation (filling missing values) and scaling (normalizing data) that integrate seamlessly with the tidymodels ecosystem in R. This guide explains the syntax and implementation of these functions in straightforward terms.

Data Imputation with hai_data_impute()

Imputation replaces missing values in your dataset. The hai_data_impute() function supports multiple imputation methods through a consistent interface.

Basic Syntax

hai_data_impute(
  .recipe_object = NULL,
  ...,
  .type_of_imputation = "mean",
  .seed_value = 123,
  # Additional parameters based on method
)

Key arguments:

.recipe_object: Recipe object containing your data
...: Variables to impute (using selector functions)
.type_of_imputation: Method for imputation
Method-specific parameters (e.g., .neighbors for KNN)

Supported Imputation Methods

Method	Description	Best For	Key Parameters
“mean”	Replace with column mean	Normal distributions	`.mean_trim`
“median”	Replace with column median	Skewed data, outliers present	None
“mode”	Replace with most frequent value	Categorical variables	None
“knn”	K-nearest neighbors imputation	Complex relationships	`.neighbors` (default: 5)
“bagged”	Bagged tree imputation	Non-linear patterns	`.number_of_trees` (default: 25)
“roll”	Rolling window statistic	Time series data	`.roll_window`, `.roll_statistic`

Example: Rolling Median Imputation

library(healthyR.ai)
library(recipes)
library(dplyr)

# Create recipe object
rec_obj <- recipe(value ~ ., df_tbl)

# Apply rolling median imputation
imputed_data <- hai_data_impute(
  .recipe_object = rec_obj,
  value,
  .type_of_imputation = "roll",
  .roll_statistic = median
)$impute_rec_obj %>%
  get_juiced_data()

Data Scaling with hai_data_scale()

Scaling transforms variables to a common scale, which is important for many machine learning algorithms.

Basic Syntax

hai_data_scale(
  .recipe_object = NULL,
  ...,
  .type_of_scale = "center",
  .range_min = 0,
  .range_max = 1,
  .scale_factor = 1
)

Key arguments:

.recipe_object: Recipe object containing your data
...: Variables to scale (using selector functions)
.type_of_scale: Method for scaling
.range_min, .range_max: Range bounds (for “range” method)
.scale_factor: Scale by 1 or 2 standard deviations (for interpretability)

Supported Scaling Methods

Method	Description	Formula	Result Range
“center”	Subtract mean	x - mean(x)	Mean = 0, original variance
“scale”	Divide by standard deviation	x / sd(x)	Standard deviation = 1
“normalize”	Scale to unit norm	x / \|\|x\|\|	Vector length = 1
“range”	Min-max scaling	(x-min)/(max-min)	[range_min, range_max]

Example: Standardization

library(healthyR.ai)
library(recipes)
library(dplyr)

# Create recipe object
rec_obj <- recipe(value ~ ., df_tbl)

# Apply standardization (z-score)
scaled_data <- hai_data_scale(
  .recipe_object = rec_obj,
  value,
  .type_of_scale = "scale"
)$scale_rec_obj %>%
  get_juiced_data()

Combining Imputation and Scaling

A typical preprocessing workflow combines both steps:

# Create recipe
rec_obj <- recipe(target ~ ., data_df)

# Step 1: Impute missing values
imputed <- hai_data_impute(
  .recipe_object = rec_obj,
  all_numeric(),
  .type_of_imputation = "median"
)$impute_rec_obj

# Step 2: Scale the imputed data
final_data <- hai_data_scale(
  .recipe_object = imputed,
  all_numeric(),
  .type_of_scale = "range"
)$scale_rec_obj %>%
  get_juiced_data()

Exampls

I think things work best when you can see an example in action.

Imputation

library(healthyR.ai)
library(recipes)
library(dplyr)
library(ggplot2)
library(purrr)

n <- 10L
l <- 5L
lo <- n * l

date_seq <- seq.Date(from = as.Date("2013-01-01"), length.out = lo, by = "month")
date_seq

 [1] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-01" "2013-05-01"
 [6] "2013-06-01" "2013-07-01" "2013-08-01" "2013-09-01" "2013-10-01"
[11] "2013-11-01" "2013-12-01" "2014-01-01" "2014-02-01" "2014-03-01"
[16] "2014-04-01" "2014-05-01" "2014-06-01" "2014-07-01" "2014-08-01"
[21] "2014-09-01" "2014-10-01" "2014-11-01" "2014-12-01" "2015-01-01"
[26] "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01" "2015-06-01"
[31] "2015-07-01" "2015-08-01" "2015-09-01" "2015-10-01" "2015-11-01"
[36] "2015-12-01" "2016-01-01" "2016-02-01" "2016-03-01" "2016-04-01"
[41] "2016-05-01" "2016-06-01" "2016-07-01" "2016-08-01" "2016-09-01"
[46] "2016-10-01" "2016-11-01" "2016-12-01" "2017-01-01" "2017-02-01"

val_seq <- replicate(n = l, c(runif(9), NA)) |> as.vector() |> as.double()
val_seq

 [1] 0.74815520 0.62345014 0.98719405 0.98357823 0.64343460 0.38288945
 [7] 0.30782868 0.63132596 0.09734484         NA 0.79572696 0.08743225
[13] 0.72841099 0.78703884 0.39553790 0.54639674 0.96807028 0.60125354
[19] 0.74665373         NA 0.92237646 0.04457192 0.68444841 0.05388607
[25] 0.24374963 0.73552094 0.84926348 0.55056715 0.77699405         NA
[31] 0.55460139 0.24564446 0.24396533 0.60797386 0.71226179 0.93048958
[37] 0.72179306 0.01549613 0.88487496         NA 0.41888816 0.08623630
[43] 0.06213051 0.58266383 0.72425739 0.17659346 0.80285097 0.78684451
[49] 0.68433082         NA

data_tbl <- tibble(
  date_col = date_seq,
  value = val_seq
)

rec_obj <- recipe(value ~ date_col, data = data_tbl)
rec_obj

df_tbl <- tibble(
  impute_type = c("bagged","knn","linear","mean","median","roll"),
  rec_obj = list(rec_obj),
  data = list(data_tbl)
)
df_tbl[1,][[3]][[1]]

# A tibble: 50 × 2
   date_col     value
   <date>       <dbl>
 1 2013-01-01  0.748 
 2 2013-02-01  0.623 
 3 2013-03-01  0.987 
 4 2013-04-01  0.984 
 5 2013-05-01  0.643 
 6 2013-06-01  0.383 
 7 2013-07-01  0.308 
 8 2013-08-01  0.631 
 9 2013-09-01  0.0973
10 2013-10-01 NA     
# ℹ 40 more rows

data_list <- df_tbl |>
  group_split(impute_type)

data_impute_list <- data_list |>
  imap(
    .f = function(obj, id){
      imp_type = obj |> pull(impute_type)
      rec_obj = obj |> pull(rec_obj) |> pluck(1)
      data = obj[["data"]][[1]]
      
      imp_obj <- hai_data_impute(
        .recipe_object = rec_obj,
        value,
        .type_of_imputation = imp_type,
        .roll_statistic = median
      )$impute_rec_obj

      imputed_data <- get_juiced_data(imp_obj)

      combined_tbl <- data |>
        left_join(imputed_data, by = "date_col") |>
        setNames(c("date_col", "original_value", "imputed_value")) |>
        mutate(rec_no = row_number()) |>
        mutate(color_col = original_value,
              size_col = original_value) |>
        mutate(impute_type = imp_type)
      
      return(combined_tbl)
    }
  )

combined_tbl <- data_impute_list |>
  list_rbind()

imped_na_vals_tbl <- combined_tbl |>
  filter(is.na(original_value)) |>
  summarize(
        avg_imputed_val = mean(imputed_value),
        .by = impute_type
  )

combined_tbl |>
  summarize(
        avg_imputed_val_col = mean(imputed_value),
        avg_original_val_col = mean(original_value, na.rm = TRUE),
        .by = impute_type
  ) |>
mutate(imputation_diff = avg_imputed_val_col - avg_original_val_col) |>
left_join(imped_na_vals_tbl, by = "impute_type")

# A tibble: 6 × 5
  impute_type avg_imputed_val_col avg_original_val_col imputation_diff
  <chr>                     <dbl>                <dbl>           <dbl>
1 bagged                    0.556                0.559        -0.00287
2 knn                       0.557                0.559        -0.00199
3 linear                    0.558                0.559        -0.00128
4 mean                      0.559                0.559         0      
5 median                    0.566                0.559         0.00721
6 roll                      0.555                0.559        -0.00434
# ℹ 1 more variable: avg_imputed_val <dbl>

ggplot(data = combined_tbl,
  aes(
    x = date_col,
    y = imputed_value,
    color = color_col
    )
  ) + 
  facet_wrap(~ impute_type) +
  geom_point(data = combined_tbl |> filter(is.na(original_value)), aes(shape = 'NA', size = 3)) +
  scale_shape_manual(values = c('NA' = 3)) +
  geom_line(aes(x = date_col, y = original_value), color = "black") +
  geom_line(aes(x = date_col, y = imputed_value), color = "red", linetype = "dashed", alpha = .328) +
  geom_vline(
    data = combined_tbl[combined_tbl$original_value |> is.na(), ], 
    aes(xintercept = date_col), color = "black", linetype = "dashed"
  ) +
  labs(
    x = "Date",
    y = "Value",
    title = "Original vs. Imputed Data using HealthyR.ai",
    subtitle = "Function: hai_data_impute()",
    caption = "Red line is the imputed data, blue line is the original data"
  ) +
  theme_classic() +
  theme(legend.position = "none")

combined_tbl |>
  filter(is.na(original_value)) |>
  ggplot(aes(x = impute_type, y = imputed_value, color = impute_type, group = impute_type)) +
  geom_boxplot() +
  labs(
    x = "Date",
    y = "Value",
    title = "Original vs. Imputed Data using HealthyR.ai",
    subtitle = "Function: hai_data_impute()"
  ) +
  theme_classic() +
  theme(legend.position = "none")

Choosing the Right Method

For imputation:

Continuous normal data: Use “mean”
Skewed data or outliers: Use “median”
Categorical data: Use “mode”
Time series: Use “roll” with appropriate window
Complex relationships: Try “knn” or “bagged”

For scaling:

Linear regression: Use “center” or “scale”
Distance-based algorithms (KNN, SVM): Use “scale”
Neural networks: Use “range” [0,1] or “normalize”

Best Practices

Always load required libraries: healthyR.ai, recipes, dplyr
Create recipe object first: recipe(target ~ ., data = df)
Set .seed_value for reproducible results with stochastic methods
Extract processed data with get_juiced_data() function
Verify results with summary() after processing
Choose imputation method based on data characteristics and missingness pattern

Key Takeaways

healthyR.ai provides user-friendly wrappers around recipes functions for data preprocessing
Both imputation and scaling require a recipe object
Functions return a list containing the processed recipe object
Choose methods based on your data type and the requirements of your modeling approach
Chain operations (imputation → scaling → modeling) for a complete workflow

The goal of these healthyR.ai functions is to simplify data preprocessing with a consistent syntax and integration with the tidymodels ecosystem, making it an excellent choice for streamlining your machine learning pipeline.

Happy Coding! 🚀

You can connect with me at any one of the below:

Telegram Channel here: https://t.me/steveondata

LinkedIn Network here: https://www.linkedin.com/in/spsanderson/

Mastadon Social here: https://mstdn.social/@stevensanderson

RStats Network here: https://rstats.me/@spsanderson

GitHub Network here: https://github.com/spsanderson

Bluesky Network here: https://bsky.app/profile/spsanderson.com

My Book: Extending Excel with Python and R here: https://packt.link/oTyZJ

You.com Referral Link: https://you.com/join/EHSLDTL6