An Update on {tidyAML}

code
rtip
tidyaml
automl
Author

Steven P. Sanderson II, MPH

Published

January 12, 2023

Introduction

I have been doing a lot of work on a new package called {tidyAML}. {tidyAML} is a new R package that makes it easy to use the {tidymodels} ecosystem to perform automated machine learning (AutoML). This package provides a simple and intuitive interface that allows users to quickly generate machine learning models without worrying about the underlying details. It also includes a safety mechanism that ensures that the package will fail gracefully if any required extension packages are not installed on the user’s machine. With {tidyAML}, users can easily build high-quality machine learning models in just a few lines of code. Whether you are a beginner or an experienced machine learning practitioner, {tidyAML} has something to offer.

Some ideas are that we should be able to generate regression models on the fly without having to actually go through the process of building the specification, especially if it is a non-tuning model, meaning we are not planing on tuning hyper-parameters like penalty and cost.

The idea is not to re-write the excellent work the {tidymodels} team has done (because it’s not possible) but rather to try and make an enhanced easy to use set of functions that do what they say and can generate many models and predictions at once.

This is similar to the great {h2o} package, but, {tidyAML} does not require java to be setup properly like {h2o} because {tidyAML} is built on {tidymodels}.

This package is not yet release, so you can only install from GitHub with the following:

# install.packages("devtools")
devtools::install_github("spsanderson/tidyAML")

Example

library(tidyAML)

fast_regression_parsnip_spec_tbl(.parsnip_fns = "linear_reg")
# A tibble: 11 × 5
   .model_id .parsnip_engine .parsnip_mode .parsnip_fns model_spec
       <int> <chr>           <chr>         <chr>        <list>    
 1         1 lm              regression    linear_reg   <spec[+]> 
 2         2 brulee          regression    linear_reg   <spec[+]> 
 3         3 gee             regression    linear_reg   <spec[+]> 
 4         4 glm             regression    linear_reg   <spec[+]> 
 5         5 glmer           regression    linear_reg   <spec[+]> 
 6         6 glmnet          regression    linear_reg   <spec[+]> 
 7         7 gls             regression    linear_reg   <spec[+]> 
 8         8 lme             regression    linear_reg   <spec[+]> 
 9         9 lmer            regression    linear_reg   <spec[+]> 
10        10 stan            regression    linear_reg   <spec[+]> 
11        11 stan_glmer      regression    linear_reg   <spec[+]> 
fast_regression_parsnip_spec_tbl(.parsnip_eng = c("lm","glm"))
# A tibble: 3 × 5
  .model_id .parsnip_engine .parsnip_mode .parsnip_fns model_spec
      <int> <chr>           <chr>         <chr>        <list>    
1         1 lm              regression    linear_reg   <spec[+]> 
2         2 glm             regression    linear_reg   <spec[+]> 
3         3 glm             regression    poisson_reg  <spec[+]> 
fast_regression_parsnip_spec_tbl(.parsnip_eng = c("lm","glm","gee"), 
                                 .parsnip_fns = "linear_reg")
# A tibble: 3 × 5
  .model_id .parsnip_engine .parsnip_mode .parsnip_fns model_spec
      <int> <chr>           <chr>         <chr>        <list>    
1         1 lm              regression    linear_reg   <spec[+]> 
2         2 gee             regression    linear_reg   <spec[+]> 
3         3 glm             regression    linear_reg   <spec[+]> 

As shown we can easily select the models we want either by choosing the supported parsnip function like linear_reg() or by choose the desired engine, you can also use them both in conjunction with each other!

Now, what if you want to create a non-tuning model spec without using the fast_regression_parsnip_spec_tbl() function. Well, you can. The function is called create_model_spec().

create_model_spec(
 .parsnip_eng = list("lm","glm","glmnet","cubist"),
 .parsnip_fns = list(
      rep(
        "linear_reg", 3),
        "cubist_rules"
     )
 )
# A tibble: 4 × 4
  .parsnip_engine .parsnip_mode .parsnip_fns .model_spec
  <chr>           <chr>         <chr>        <list>     
1 lm              regression    linear_reg   <spec[+]>  
2 glm             regression    linear_reg   <spec[+]>  
3 glmnet          regression    linear_reg   <spec[+]>  
4 cubist          regression    cubist_rules <spec[+]>  
create_model_spec(
 .parsnip_eng = list("lm","glm","glmnet","cubist"),
 .parsnip_fns = list(
      rep(
        "linear_reg", 3),
        "cubist_rules"
     ),
 .return_tibble = FALSE
 )
$.parsnip_engine
$.parsnip_engine[[1]]
[1] "lm"

$.parsnip_engine[[2]]
[1] "glm"

$.parsnip_engine[[3]]
[1] "glmnet"

$.parsnip_engine[[4]]
[1] "cubist"


$.parsnip_mode
$.parsnip_mode[[1]]
[1] "regression"


$.parsnip_fns
$.parsnip_fns[[1]]
[1] "linear_reg"

$.parsnip_fns[[2]]
[1] "linear_reg"

$.parsnip_fns[[3]]
[1] "linear_reg"

$.parsnip_fns[[4]]
[1] "cubist_rules"


$.model_spec
$.model_spec[[1]]
Linear Regression Model Specification (regression)

Computational engine: lm 


$.model_spec[[2]]
Linear Regression Model Specification (regression)

Computational engine: glm 


$.model_spec[[3]]
Linear Regression Model Specification (regression)

Computational engine: glmnet 


$.model_spec[[4]]
Cubist Model Specification (regression)

Computational engine: cubist 

Now the reason we are here. Let’s take a look at the first function for modeling with tidyAML, fast_regression().

library(recipes)
library(dplyr)
library(purrr)

rec_obj <- recipe(mpg ~ ., data = mtcars)
frt_tbl <- fast_regression(
  .data = mtcars, 
  .rec_obj = rec_obj, 
  .parsnip_eng = c("lm","glm"),
  .parsnip_fns = "linear_reg"
)

glimpse(frt_tbl)
Rows: 2
Columns: 8
$ .model_id       <int> 1, 2
$ .parsnip_engine <chr> "lm", "glm"
$ .parsnip_mode   <chr> "regression", "regression"
$ .parsnip_fns    <chr> "linear_reg", "linear_reg"
$ model_spec      <list> [~NULL, ~NULL, NULL, regression, TRUE, NULL, lm, TRUE]…
$ wflw            <list> [cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, mp…
$ fitted_wflw     <list> [cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, mp…
$ pred_wflw       <list> [<tbl_df[24 x 1]>], [<tbl_df[24 x 1]>]

Now lets take a look at a few different things in the frt_tbl.

names(frt_tbl)
[1] ".model_id"       ".parsnip_engine" ".parsnip_mode"   ".parsnip_fns"   
[5] "model_spec"      "wflw"            "fitted_wflw"     "pred_wflw"      

Let’s look at a single model spec.

frt_tbl %>% slice(1) %>% select(model_spec) %>% pull() %>% pluck(1)
Linear Regression Model Specification (regression)

Computational engine: lm 

Now the wflw column.

frt_tbl %>% slice(1) %>% select(wflw) %>% pull() %>% pluck(1)
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

The fitted wflw object.

frt_tbl %>% slice(1) %>% select(fitted_wflw) %>% pull() %>% pluck(1)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept)          cyl         disp           hp         drat           wt  
   11.77621      0.59296      0.01626     -0.03191     -0.55350     -5.30785  
       qsec           vs           am         gear         carb  
    0.97840      2.64023      1.68549      0.87059      0.58785  
frt_tbl %>% slice(1) %>% select(fitted_wflw) %>% pull() %>% pluck(1) %>%
  broom::glance() %>%
  glimpse()
Rows: 1
Columns: 12
$ r.squared     <dbl> 0.9085669
$ adj.r.squared <dbl> 0.8382337
$ sigma         <dbl> 2.337527
$ statistic     <dbl> 12.91804
$ p.value       <dbl> 3.367361e-05
$ df            <dbl> 10
$ logLik        <dbl> -47.07551
$ AIC           <dbl> 118.151
$ BIC           <dbl> 132.2877
$ deviance      <dbl> 71.03241
$ df.residual   <int> 13
$ nobs          <int> 24

And finally the predictions (this one I am probably going to change up).

frt_tbl %>% slice(1) %>% select(pred_wflw) %>% pull() %>% pluck(1)
# A tibble: 24 × 1
   .pred
   <dbl>
 1  17.4
 2  28.4
 3  17.2
 4  10.7
 5  13.4
 6  17.0
 7  22.8
 8  14.3
 9  22.4
10  15.5
# … with 14 more rows

Voila!