Introduction

The formula() function in R is a generic function that is used to create and manipulate formulas. Formulas are used to specify the relationship between variables in statistical models. The basic syntax for a formula is:

response ~ predictors

The response is the variable that you are trying to predict, and the predictors are the variables that you are using to predict the response. You can use multiple predictors by separating them with + signs. For example, the following formula predicts the mpg (miles per gallon) of a car based on the wt (weight) and hp (horsepower) of the car:

mpg ~ wt + hp

The formula() function can be used to create formulas from scratch, or it can be used to extract formulas from existing objects. For example, the following code creates a formula object called my_formula that predicts the mpg of a car based on the wt and hp of the car:

my_formula <- formula(mpg ~ wt + hp)
my_formula

mpg ~ wt + hp

The formula() function can also be used to manipulate formulas. For example, the following code adds a new predictor called drat (drive ratio) to the my_formula formula:

my_formula <- update(my_formula, mpg ~ wt + hp + drat)
my_formula

mpg ~ wt + hp + drat

The formula() function is a powerful tool that can be used to create, manipulate, and analyze formulas in R.

Here are some additional things to know about the formula() function:

Formulas are objects in R, and they have a number of methods that can be used to manipulate them. For example, you can use the summary() method to get a summary of a formula, or you can use the plot() method to plot a formula.
Formulas can be used with a variety of statistical functions in R. For example, you can use the lm() function to fit a linear model to a formula, or you can use the glm() function to fit a generalized linear model to a formula.
Formulas are a powerful tool for statistical analysis, and they can be used to solve a wide variety of problems. If you are working with data in R, it is important to understand how to use formulas.

Now that we have a decent understanding of the function, I want to shift focus a little bit and show how we can use the generics function formula() in order to extract a formula from a recipe object.

Here is the full code that we are going to look at:

library(recipes)

rec_obj <- recipe(mpg ~ ., data = mtcars)

summary(rec_obj)

# A tibble: 11 × 4
   variable type      role      source  
   <chr>    <list>    <chr>     <chr>   
 1 cyl      <chr [2]> predictor original
 2 disp     <chr [2]> predictor original
 3 hp       <chr [2]> predictor original
 4 drat     <chr [2]> predictor original
 5 wt       <chr [2]> predictor original
 6 qsec     <chr [2]> predictor original
 7 vs       <chr [2]> predictor original
 8 am       <chr [2]> predictor original
 9 gear     <chr [2]> predictor original
10 carb     <chr [2]> predictor original
11 mpg      <chr [2]> outcome   original

# Get formula
rec_obj |> prep() |> formula()

mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
<environment: 0x0000013e3255d2f0>

Let’s break down each line and understand what it does:

library(recipes)

The first line imports the recipes package, which is a powerful tool for preparing and preprocessing data in a structured and reproducible manner.

rec_obj <- recipe(mpg ~ ., data = mtcars)

Here, we create a recipe object named rec_obj. This object represents a set of instructions for data transformation. In this case, we specify the formula mpg ~ ., which means we want to predict the miles per gallon (mpg) using all other variables in the mtcars dataset.

rec_obj |> prep() |> formula()

The next line leverages the magrittr pipe operator (|>) to chain multiple operations. Let’s break it down:

rec_obj is passed to the prep() function. This function performs data preparation steps specified in the recipe object, such as handling missing values, feature scaling, or encoding categorical variables.
The output of prep() is then piped to the formula() function, which extracts the formula representation from the preprocessed recipe object. The resulting formula can be used in subsequent modeling steps.

That’s it! With just a few lines of code, we have defined a recipe, prepared the data accordingly, and obtained the formula representation for further modeling.

Now, let’s dive into a couple more examples to showcase the versatility of the recipes package:

rec_obj <- recipe(Species ~ ., data = iris) |>
  step_normalize(all_predictors())

rec_obj |> prep() |> formula()

Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
<environment: 0x0000013e2e8346f0>

In this example, we create a recipe to predict the species (Species) using all other variables in the iris dataset. We then use the step_normalize() function to standardize all predictor variables in the recipe. This step ensures that variables are on a similar scale, which can be beneficial for certain machine learning algorithms.

rec_obj <- recipe(SalePrice ~ ., data = train_data) |>
  step_dummy(all_nominal(), -all_outcomes())

Here, we define a recipe to predict the sale price (SalePrice) using all other variables in the train_data dataset. The step_dummy() function is used to convert all nominal variables in the recipe into dummy variables. The all_nominal() argument specifies that all variables should be considered, while the -all_outcomes() argument ensures that the outcome variable (SalePrice) is not transformed.

These examples provide a glimpse into the power and flexibility of the recipes package for data preprocessing in R. It enables you to define a clear and reproducible data transformation pipeline that can greatly simplify your machine learning workflows.

Happy coding! 🚀