Home » vif in Python

vif in Python

Before discussing about the vif, it is essential to understand first what is multicollinearity in linear regression?

The situation of multicollinearity arises when two independent variables have a strong correlation.

Whenever we perform exploratory data analysis, the objective is to obtain a significant parameter that affects our target variable.

Therefore, correlation is the major step that helps us to understand the linear relationship that exists between two variables.

What is Correlation?

Correlation measures the scope to which two variables are interdependent.

A visual idea of checking what kind of a correlation exists between the two variables. We can plot a graph and interpret how does a rise in the value of one attribute affects the other attribute.

Concerning statistics, we can obtain the correlation using Pearson Correlation. It gives us the correlation coefficient and the P-value.

Let us have a look at the criteria-

CORRELATION COEFFICIENT RELATIONSHIP
1. Close to +1 Large Positive
2. Close to -1 Large Negative
3. Close to 0 No relationship exists
P-VALUE CERTAINITY
P-value<0.001 Strong
P-value<0.05 Moderate
P-value<0.1 Weak
P-value>0.1 No

Since now we have a detailed idea of correlation, we now understood that if a strong correlation exists between two independent variables of a dataset, it leads to multicollinearity.

Let’s discuss what kind of issues can occur because of multicollinearity-

  1. Since there is a strong relationship, determining the significant variables would be a difficult task.
  2. The coefficients that we will obtain for the variables can be unstable and as a consequence, the interpretation of a model would be a tedious job.
  3. Overfitting might occur and the accuracy of the model would change with the dataset.

Checking Multicollinearity

The two methods of checking the multicollinearity are-

  • Plotting a heatmap to comprehend the correlation
  • Use Variance Inflation Factor

Plotting a heatmap to comprehend the correlation

Taking a dataset, and plotting a heatmap will help us to infer which attribute has the most significant value of correlation. This value will tell us the extent of influence between the dependent variable and the independent variable.

Let us have a look at a program that shows how it can be implemented.

Example –

Output:

vif in Python

Use Variance Inflation Factor

The Variance Inflation Factor is the measure of multicollinearity that exists in the set of variables that are involved in multiple regressions.

Generally, the vif value above 10 indicates that there is a high correlation with the other independent variables.

Let us have a look at a program that shows how it can be implemented.

Example –

Output:

vif in Python

You may also like