Data Imputation and Scaling with healthyR.ai: A Guide for R Programmers
Learn how to efficiently handle missing data and scale variables in R using healthyR.ai. This guide explains imputation and scaling methods with clear syntax, practical examples, and best practices for streamlined data preprocessing in your machine learning projects.
code
rtip
Author
Steven P. Sanderson II, MPH
Published
September 29, 2025
Keywords
Programming, data imputation, data scaling, healthyR.ai, R data preprocessing, R imputation methods, R scaling functions, missing data handling in R, hai_data_impute example, hai_data_scale tutorial, tidymodels data preprocessing, how to impute missing values in R with healthyR.ai, step by step data scaling using hai_data_scale in R, best practices for data preprocessing in R with healthyR.ai, rolling window imputation example using healthyR.ai package, combining imputation and scaling in R with healthyR.ai functions
This guide covers some data preprocessing techniques using healthyR.ai, focusing on imputation and scaling functions with clear syntax examples and best practices.
Introduction
Data preprocessing is a necessary step in any machine learning workflow. The healthyR.ai package offers user-friendly functions for imputation (filling missing values) and scaling (normalizing data) that integrate seamlessly with the tidymodels ecosystem in R. This guide explains the syntax and implementation of these functions in straightforward terms.
Data Imputation with hai_data_impute()
Imputation replaces missing values in your dataset. The hai_data_impute() function supports multiple imputation methods through a consistent interface.
Basic Syntax
hai_data_impute(.recipe_object =NULL, ...,.type_of_imputation ="mean",.seed_value =123,# Additional parameters based on method)
Key arguments:
.recipe_object: Recipe object containing your data
...: Variables to impute (using selector functions)
.type_of_imputation: Method for imputation
Method-specific parameters (e.g., .neighbors for KNN)
Supported Imputation Methods
Method
Description
Best For
Key Parameters
“mean”
Replace with column mean
Normal distributions
.mean_trim
“median”
Replace with column median
Skewed data, outliers present
None
“mode”
Replace with most frequent value
Categorical variables
None
“knn”
K-nearest neighbors imputation
Complex relationships
.neighbors (default: 5)
“bagged”
Bagged tree imputation
Non-linear patterns
.number_of_trees (default: 25)
“roll”
Rolling window statistic
Time series data
.roll_window, .roll_statistic
Example: Rolling Median Imputation
library(healthyR.ai)library(recipes)library(dplyr)# Create recipe objectrec_obj <-recipe(value ~ ., df_tbl)# Apply rolling median imputationimputed_data <-hai_data_impute(.recipe_object = rec_obj, value,.type_of_imputation ="roll",.roll_statistic = median)$impute_rec_obj %>%get_juiced_data()
Data Scaling with hai_data_scale()
Scaling transforms variables to a common scale, which is important for many machine learning algorithms.
# A tibble: 6 × 5
impute_type avg_imputed_val_col avg_original_val_col imputation_diff
<chr> <dbl> <dbl> <dbl>
1 bagged 0.556 0.559 -0.00287
2 knn 0.557 0.559 -0.00199
3 linear 0.558 0.559 -0.00128
4 mean 0.559 0.559 0
5 median 0.566 0.559 0.00721
6 roll 0.555 0.559 -0.00434
# ℹ 1 more variable: avg_imputed_val <dbl>
ggplot(data = combined_tbl,aes(x = date_col,y = imputed_value,color = color_col ) ) +facet_wrap(~ impute_type) +geom_point(data = combined_tbl |>filter(is.na(original_value)), aes(shape ='NA', size =3)) +scale_shape_manual(values =c('NA'=3)) +geom_line(aes(x = date_col, y = original_value), color ="black") +geom_line(aes(x = date_col, y = imputed_value), color ="red", linetype ="dashed", alpha = .328) +geom_vline(data = combined_tbl[combined_tbl$original_value |>is.na(), ], aes(xintercept = date_col), color ="black", linetype ="dashed" ) +labs(x ="Date",y ="Value",title ="Original vs. Imputed Data using HealthyR.ai",subtitle ="Function: hai_data_impute()",caption ="Red line is the imputed data, blue line is the original data" ) +theme_classic() +theme(legend.position ="none")
combined_tbl |>filter(is.na(original_value)) |>ggplot(aes(x = impute_type, y = imputed_value, color = impute_type, group = impute_type)) +geom_boxplot() +labs(x ="Date",y ="Value",title ="Original vs. Imputed Data using HealthyR.ai",subtitle ="Function: hai_data_impute()" ) +theme_classic() +theme(legend.position ="none")
Create recipe object first: recipe(target ~ ., data = df)
Set .seed_value for reproducible results with stochastic methods
Extract processed data with get_juiced_data() function
Verify results with summary() after processing
Choose imputation method based on data characteristics and missingness pattern
Key Takeaways
healthyR.ai provides user-friendly wrappers around recipes functions for data preprocessing
Both imputation and scaling require a recipe object
Functions return a list containing the processed recipe object
Choose methods based on your data type and the requirements of your modeling approach
Chain operations (imputation → scaling → modeling) for a complete workflow
The goal of these healthyR.ai functions is to simplify data preprocessing with a consistent syntax and integration with the tidymodels ecosystem, making it an excellent choice for streamlining your machine learning pipeline.