*51*

AÂ **box-cox transformation** is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.

The basic idea behind this method is to find some value for Î» such that the transformed data is as close to normally distributed as possible, using the following formula:

- y(Î») = (y
^{Î»}â€“ 1) / Î»Â if y â‰ 0 - y(Î») = log(y)Â if y = 0

We can perform a box-cox transformation in R by using theÂ **boxcox()** function from theÂ **MASS()** library. The following example shows how to use this function in practice.

*Refer to this paper from the University of Connecticut for a nice summary of the development of the Box-Cox transformation.*

**Example: Box-Cox Transformation in R**

The following code shows how to fit a linear regression model to a dataset, then use theÂ **boxcox()** function to find an optimal lambda to transform the response variable and fit a new model.Â

library(MASS) #create data y=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 6, 7, 8) x=c(7, 7, 8, 3, 2, 4, 4, 6, 6, 7, 5, 3, 3, 5, 8) #fit linear regression model model #find optimal lambda for Box-Cox transformation bc #fit new linear regression model using the Box-Cox transformation new_model

The optimal lambda was found to beÂ **-0.4242424**. Thus, the new regression model replaced the original response variable y with the variable y = (y^{-0.4242424} â€“ 1) / -0.4242424.

The following code shows how to create two Q-Q plots in R to visualize the differences in residuals between the two regression models:

#define plotting area op #Q-Q plot for original model qqnorm(model$residuals) qqline(model$residuals) #Q-Q plot for Box-Cox transformed model qqnorm(new_model$residuals) qqline(new_model$residuals) #display both Q-Q plots par(op)

As a rule of thumb, if the data points fall along a straight diagonal line in a Q-Q plot then the dataset likely follows a normal distribution.

Notice how the box-cox transformed model produces a Q-Q plot with a much straighter line than the original regression model.

This is an indication that the residuals of the box-cox transformed model are much more normally distributed, which satisfies one of the assumptions of linear regression.

**Additional Resources**

How to Transform Data in R (Log, Square Root, Cube Root)

How to Create & Interpret a Q-Q Plot in R

How to Perform a Shapiro-Wilk Test for Normality in R