# VIF stepwise variable selection

Author: Paulo van Breugel
Updated on: 21-06-17

## Introduction

### Multicollinearity

When you are including many variables in your analysis or model, some may measure (partially) the same thing. For example, the maximum and minimum mean temperature of the month are highly related to the mean monthly temperature. In such a case your data suffers from multicollinearity. Multicollinearity is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.

Issues with multicollinearity are that the standard errors of the affected coefficients tend to be large, the data redundancy in the explanatory variables may result in model overfitting, it may result in erratic changes of coefficient estimates in response to small changes in the model or the data, and if the pattern of multicollinearity in the new data differs from that in the data that was fitted, extrapolation may introduce large errors in the predictions (O’brien 2007; Dormann et al. 2012). Predictive uncertainty caused by multicollinearity thus poses a challenge for predictive environmental niche or species distribution modeling, especially when used to predict the distribution of a species under novel conditions (e.g., future climates or in new areas).

### Variance inflation factor

One way to detect multicollinearity is the variance inflation factor analysis (Graham 2003). The VIF is widely used as a measure of the degree of multi-collinearity of the ith independent variable with the other independent variables in a regression model. If we have explanatory variables X1, X2, X3, ... Xi, the VIF for an explanatory variable X1 can be calculated by running an ordinary least square regression that has X1 as a function of all the other explanatory variables X2 ... Xi. The VIF is than computed following Equation 1.

(Equation 1)

where R2 is the coefficient of determination of the regression equation. This can be repeated for each of the explanatory variables. The size of VIF gives the magnitude of the multicollinearity. The square root of the VIF shows how much larger the standard error is, compared with what it would be if that variable were uncorrelated with the other predictor variables in the model. Thus, with a VIF of 10 for variable Xi, the standard error for the coefficient of that variable is √10 = 3.2 times larger than it would be if Xi would be uncorrelated to the other predictor variables. No formal cutoff value or method exists to determine when a VIF is too large. As a rule of thumb, VIF values in excess of 5 or 10 are often considered an indication that multicollinearity may by a cause of problem (Neter et al. 1989; Menard 1995; Mason et al. 2003).

To select a set of variables with sufficient low multicollinearity, a stepwise selection routine can be implemented to remove variables causing loss of precision in parameter estimates by starting with the variable having the largest VIF (Craney & Surles 2002). This is done by computing the VIF values for the full set of explanatory variables (X1 ... Xi), after which the variable with the highest VIF is removed. Next, the VIF values are computed again for the reduced set of variables. This is repeated till the largest VIF is smaller than an user defined maximum VIF threshold value.

## Computing VIF in GRASS GIS

The r.vif addon for GRASS GIS was created to compute the variance inflation factor for a user-defined set of variables. Like any function in GRASS GIS, the addon can be run from the command line interface (CLI) or using a graphical user interface (GUI; Figure 1). An explanation of the different options is provided in the manual page, which is accessible in the GUI under the tab 'Manual' or can be viewed online. The source code of the addon can be viewed and downloaded from osgeo grass trac site.

Figure 1. The graphical user interface (GUI) of the r.vif addon for GRASS GIS.

As a minimum input the module requires the user to provide the names of the raster layers representing the explanatory variables of a model. With these, the model will compute the variance inflation factor (VIF) for each of the variables. Results are written to the command output (if using the GUI) or the console (if using CLI). In addition, the user can opt to have the results written to an output text file.

The user can also provide a maximum VIF (maxvif), in which case the addon will run a stepwise procedure as described above. Results of each round will be printed to the console and to a user define output file as well. A third option is to retain one or more variables in the stepwise selection. This can be useful if one or more variables are known to be important determinants of the dependent variable. For example, for a species known to be sensitive to low temperatures one may want to include the mean minimum temperature of the coldest month (bio_6). If the user opts to retain this variable, it will be kept at each round of the stepwise procedure. If the variables happens to have the highest VIF, the variable with the next highest VIF will be removed instead.

### Notes

To compute the vif all data layers are read in as a numpy array (non-data cells are ignored). Memory usage may become problematic for large input data sets. In such cases the user may opt to sample raster values for random locations and use that to compute the vif. The quantity of random locations to be generated either can be defined as a positive integer, or as a percentage of the raster map layer's cells (see r.random for details).

Using random sub-set of raster cells as input means that the vif values may vary between runs. If the sub-set is too small it may even lead to differences in variables selected when running the step-wise procedure. When running a step-wise procedure, special care should be taken when many of the equations are underdetermined (vif = Inf) in the first rounds.

As an alternative, the user can set the f flag to evoke the 'low-memory option'. This will use the r.regression.multi function in the background. It uses all raster values and can handle very large raster layers. The disadvantage is that it runs much slower.

## Examples

### Sample data set

The data used in the examples below are the 19 bioclimatic raster layers downloaded from http://worldclim.org. They represent variables that are derived from the monthly temperature and rainfall values from the Worldclim dataset (Hijmans et al. 2005) in order to generate more biologically meaningful variables. For the examples below, we first need to import the bioclim layers in the North Carolina (NC) sample GRASS database, following the steps outline in the tutorial on how to import and reproject data. In the same post, a description of each of the bioclim variables is given. To run the examples below, start GRASS GIS in the mapset in which you have imported the bioclim data.

### Example 1 - computing the VIF

Because the bioclimatic variables are all derived from the same baseline data, multi-collinearity is likely to be a problem. How much can be examined by computing the VIF for each of the 19 bioclimatic variables. Note that in the script below, first all bioclimatic variables are listed and assigned to the variable ‘MAPS’ using the g.list function. This way there is no need to enter all 19 variables as input in the r.vif function (a good example of the convenience of working on the command line rather than using the GUI).

``````MAPS=`g.list type=raster pattern="^bio*" sep=,`
r.vif maps=\$MAPS file=example_1.csv``````

The results of the run are written to the console or, if using the GUI, the command output. In addition, the results can be written to a comma delimited file. In this case, this is the file example_1.csv. It contains the same information as printed to screen, but it can be easily imported and used in other programs like LibreOffice, Excel or R.

``````variable vif      sqrtvif
bio1   1745.92    41.78
bio10  2920.86    54.04
bio11  3376.83    58.11
bio12   177.43    13.32
bio13    74.42     8.63
bio14    33.45     5.78
bio15   135.52    11.64
bio16    95.78     9.79
bio17   120.81    10.99
bio18    39.61     6.29
bio19    61.00     7.81
bio2     86.84     9.32
bio3     31.08     5.58
bio4    356.75    18.89
bio5       inf      inf
bio6       inf      inf
bio7       inf      inf
bio8      6.93     2.63
bio9      5.70     2.39

Statistics are written to example_1.csv``````

The results show that multi-collinearity is indeed a serious problem. The regression model of bio 5, bio6 and bio 7 against the other variables even give a nearly perfect fit (R2 = 1) which results in an undefined VIF (denoted by inf in the table with results). But VIF values are very high for bio 1, bio 10 and bi 11 as well, indicating that these variables can be predicted with a high level of accuracy by using the other variables as preditors.

### Example 2 - variable selection

So, the next step is to select a set of variables with low multi-collinearity. This can be done by setting the maxvif parameter. This will tell the function to run a stepwise selection procedure as explained above. In the example below, the maximum VIF is set to 10.

``````MAPS=`g.list type=raster pattern="^bio*" sep=,`
r.vif maps=\$MAPS maxvif=10 file=example_2.csv``````

Below an excerpt of the output is given. It shows that there were 13 rounds needed, i.e., 12 variables had to be removed, before the remaining variables all had a VIF < 10. These are the 7 variables listed at the end.

``````VIF round 12
--------------------------------------
variable      vif  sqrtvif
bio1      3.12     1.77
bio14     5.91     2.43
bio18     3.94     1.98
bio19    11.78     3.43
bio2      1.30     1.14
bio4      3.56     1.89
bio8      5.26     2.29
bio9      5.06     2.25

VIF round 13
--------------------------------------
variable      vif  sqrtvif
bio1      2.82     1.68
bio14     2.89     1.70
bio18     3.79     1.95
bio2      1.26     1.12
bio4      3.45     1.86
bio8      4.67     2.16
bio9      2.50     1.58
/n
selected variables are:
--------------------------------------
bio1, bio14, bio18, bio2, bio4, bio8, bio9

Statistics are written to example_2.csv``````

### Example 3 - retaining a variable

Now what if you are modeling the potential distribution of a species, and you know from other sources that the species is intolerant to frost and sensitive to extended dry periods. That means that e.g., the mean minimum temperature of the coldest month (bio 6) and the Precipitation of Driest Quarter (bio 17) are likely to be important variables. They are, however, not included in the set of selected variables in the previous example. If you check out the output file from the previous example, you'll see that bio 6 was removed at round 4, while bio 17 was removed at round 9 of the step-wise selection procedure . If you want to include specific variables for your analysis, the r.vif function offers the option to retain one or more variables.

``````MAPS=`g.list type=raster pattern="^bio*" sep=,`
r.vif maps=\$MAPS file=example_3.csv maxvif=10 retain=bio6,bio17``````

The output is similar to that in example 2, but this time you can see that bio 6 and bio 17 are included in the selected variables. The number of variables is the same, but that may not always be the case.

``````VIF round 13
--------------------------------------
variable   vif   sqrtvif
bio17     3.44      1.86
bio18     4.11      2.03
bio2      1.30      1.14
bio4      4.00      2.00
bio 6     3.55      1.88
bio8      4.55      2.13
bio9      2.41      1.55

selected variables are:
-------------------------------------
bio17, bio18, bio2, bio4, bio6, bio8, bio9

Statistics are written to example_3.csv``````

### Example 4 - using the output in other functions

What if you are less interested in the (potentially long) output, but rather want to use the list with selected variables in another function? Well, you can set the 's' flag, which will tell the function to only print the list with selected variables to screen, making it easier to parse this list in a script, or pipe it to another function.

#### On the command line

In the example below, the r.vif prints the list of selected variables, which is captured in the variable SELECT. This is used as input in the i.group function, which is used to create a group of layers.

``````MAPS=`g.list type=raster pattern="^bio*" sep=,`
SELECT=`r.vif -s maps=\$MAPS maxvif=10`
i.group group=mygroup input=\$SELECT``````

Now, we can run i.group with the l flag to list all layers that are part of the newly created image group 'mygroup'. It should come as no surprise to you that these are the layers bio1, bio14, bio18, bio2, bio4, bio8, and bio9.

``````i.group -l group=mygroup group <mygroup> references the following raster maps
-------------
<bio1@species> <bio14@species> <bio18@species> <bio2@species>
<bio4@species>     <bio8@species>     <bio9@species>
-------------``````

#### In a Python script

You can do the same in a python script. For example, in the example below, we select a sub-set of bioclimatic variables, and use that in a multiple linear regression to predict the distribution of land use type 1.

``````# Import modules
import grass.script as gs

# Get list of maps
VARS = filter(None, VARS.split('\n'))

# Select variables using the r.vif function
MAPS = gs.read_command("r.vif", flags="s", maps=VARS, maxvif="10")
MAPS = filter(None, MAPS.split('\n'))[0].split(',')

# Use the variables
gs.run_command("i.group", group="mygroup", input=MAPS)``````

## References

• Craney, T.A., & Surles, J.G. 2002. Model-Dependent Variance Inflation Factor Cutoff Values. Quality Engineering 14: 391–403.
• Dormann, C.F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J.R.G., Gruber, B., Lafourcade, B., Leitão, P.J., Münkemüller, T., McClean, C., Osborne, P.E., Reineking, B., Schröder, B., Skidmore, A.K., Zurell, D., & Lautenbach, S. 2012. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography. doi: 10.1111/j.1600-0587.2012.07348.x
• Graham, M.H. 2003. Confronting multicollinearity in ecological multiple regression. Ecology 84: 2809–2815.
• Hijmans, R.J., Cameron, S.E., Parra, J.L., Jones, P.G., & Jarvis, A. 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965–1978. * Mason, R.L., Gunst, R.F., & Hess, J.L. 2003. Statistical design and analysis of experiments: with applications to engineering and science. John Wiley & Sons.
• Menard, S. 1995. Applied logistic regression analysis: Sage university series on quantitative applications in the social sciences. Thousand Oaks, CA: Sage.
• Neter, J., Wasserman, W., & Kutner, M.H. 1989. Applied linear regression models. Irwin Homewood, IL.
• O’brien, R.M. 2007. A Caution Regarding Rules of Thumb for Variance Inflation Factors. Quality & Quantity 41: 673–690.