set.seed(123) # For reproducibility
<- rnorm(100, mean = 170, sd = 10)
height <- rnorm(100, mean = 70, sd = 15)
weight <- data.frame(height, weight) data
Introduction
Today, we’re diving into a fundamental data pre-processing technique: scaling values. This might sound simple, but it can significantly impact how your data behaves in analyses.
Why Scale?
Imagine you have data on customer ages (in years) and purchase amounts (in dollars). The age range might be 18-80, while purchase amounts could vary from $10 to $1000. If you use these values directly in a model, the analysis might be biased towards the purchase amount due to its larger scale. Scaling brings both features (age and purchase amount) to a common ground, ensuring neither overpowers the other.
The scale()
Function
R offers a handy function called scale()
to achieve this. Here’s the basic syntax:
<- scale(x, center = TRUE, scale = TRUE) scaled_data
data
: This is the vector or data frame containing the values you want to scale. A numeric matrix(like object)center
: Either a logical value or numeric-alike vector of length equal to the number of columns of x, where ‘numeric-alike’ means that as.numeric(.) will be applied successfully if is.numeric(.) is not true.scale
: Either a logical value or numeric-alike vector of length equal to the number of columns of x.scaled_data
: This stores the new data frame with scaled values (typically one standard deviation from the mean).
Example in Action!
Let’s see scale()
in action. We’ll generate some sample data for height (in cm) and weight (in kg) of individuals:
This creates a data frame (data
) with 100 rows, where height
has values around 170 cm with a standard deviation of 10 cm, and weight
is centered around 70 kg with a standard deviation of 15 kg.
Visualizing Before and After
Now, let’s visualize the distribution of both features before and after scaling. We’ll use the ggplot2
package for this:
library(ggplot2)
library(dplyr)
library(tidyr)
# Make Scaled data and cbind to original
<- scale(data)
scaled_data setNames(cbind(data, scaled_data), c("height", "weight", "height_scaled", "weight_scaled")) -> data
# Tidy data for facet plotting
<- pivot_longer(
data_long
data, cols = c(height, weight, height_scaled, weight_scaled),
names_to = "variable",
values_to = "value"
)
# Visualize
|>
data_long ggplot(aes(x = value, fill = variable)) +
geom_histogram(
bins = 30,
alpha = 0.328) +
facet_wrap(~variable, scales = "free") +
labs(
title = "Distribution of Height and Weight Before and After Scaling"
+
) theme_minimal()
Run this code and see the magic! The histograms before scaling will show a clear difference in spread between height and weight. After scaling, both distributions will have a similar shape, centered around 0 with a standard deviation of 1.
Try it Yourself!
This is just a basic example. Get your hands dirty! Try scaling data from your own projects and see how it affects your analysis. Remember, scaling is just one step in data pre-processing. Explore other techniques like centering or normalization depending on your specific needs.
So, the next time you have features with different scales, consider using scale()
to bring them to a level playing field and unlock the full potential of your models!