Multicollinearity in regression analysis occurs when two or more explanatory variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model. If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the regression model.
For example, suppose you run a multiple linear regression with the following variables:
Response variable: max vertical jump
Explanatory variables: shoe size, height, time spent practicing
In this case, the explanatory variables shoe size and height are likely to be highly correlated since taller people tend to have larger shoe sizes. This means that multicollinearity is likely to be a problem in this regression.
Fortunately, it’s possible to detect multicollinearity using a metric known as the variance inflation factor (VIF), which measures the correlation and strength of correlation between the explanatory variables in a regression model.
This tutorial explains how to use VIF to detect multicollinearity in a regression analysis in Stata.
Example: Multicollinearity in Stata
For this example we will use the Stata built-in dataset called auto. Use the following command to load the dataset:
We’ll use the regress command to fit a multiple linear regression model using price as the response variable and weight, length, and mpg as the explanatory variables:
regress price weight length mpg
Next, we’ll use the vif command to test for multicollinearity:
This produces a VIF value for each of the explanatory variables in the model. The value for VIF starts at 1 and has no upper limit. A general rule of thumb for interpreting VIFs is as follows:
- A value of 1 indicates there is no correlation between a given explanatory variable and any other explanatory variables in the model.
- A value between 1 and 5 indicates moderate correlation between a given explanatory variable and other explanatory variables in the model, but this is often not severe enough to require attention.
- A value greater than 5 indicates potentially severe correlation between a given explanatory variable and other explanatory variables in the model. In this case, the coefficient estimates and p-values in the regression output are likely unreliable.
We can see that the VIF values for both weight and length are greater than 5, which indicates that multicollinearity is likely a problem in the regression model.
How to Deal with Multicollinearity
Often the easiest way to deal with multicollinearity is to simply remove one of the problematic variables since the variable you’re removing is likely redundant anyway and adds little unique or independent information the model.
To determine which variable to remove, we can use the corr command to create a correlation matrix to view the correlation coefficients between each of the variables in the model, which can help us identify which variables might be highly correlated with each other and could be causing the problem of multicollinearity:
corr price weight length mpg
We can see that length is highly correlated with both weight and mpg, and it has the lowest correlation with the response variable price. Thus, removing length from the model could solve the problem of multicollinearity without reducing the overall quality of the regression model.
To test this, we can perform the regression analysis again using just weight and mpg as explanatory variables:
regress price weight mpg
We can see that the adjusted R-squared of this model is 0.2735 compared to 0.3298 in the previous model. This indicates that the overall usefulness of the model decreased only slightly. Next, we can find the VIF values again using the VIF command:
Both VIF values are below 5, which indicates that multicollinearity is no longer a problem in the model.