Multi-variate regression works in exactly the same way that simple regression works, with the exception that it is using more than one explanatory variable. This also means that it is not in two dimensional space, and is harder to picture mentally, although all the same ideas apply.
The basic equation for multi-variate regression is
\[y=\beta_0+\beta_1 x_1+\beta_2 x_2+\dots+\beta_m x_m+e\]
where \(\beta_0\) is the intercept, \(\beta_1\) is the slope for \(x_1\), \(\beta_2\) is the slope for \(x_2\), etc. \(e\) is still the error term. This regressions model is applied to \(n\) observations, one from the response variable, and one each from the \(m\) explanatory variables. Ideally pick values that you can justify based on practical or theoretical grounds. You could also choose variables that generate the largest value of Adjusted \(R^2\), or you could choose those with the most significant p-values. Remember, when looking at the goodness-of-fit for multi-variate regression models, you must use Adjusted \(R^2\)!
Let’s try to conduct a multi-variate regression on the cars data. First, make sure that you’ve selected your dataAnalysis folder as your working directory. Now, read the cars data into R:
cars<-read.csv("cars.csv",as.is=TRUE) ##Remember, the as.is=TRUE means that text data becomes character data
Here’s a reminder of what each of the fields in this data set represent
Name | Description |
---|---|
mpg | Miles/(US) gallon |
cyl | Number of cylinders |
disp | Displacement (cu.in.) |
hp | Gross horsepower |
drat | Rear axle ratio |
wt | Weight (lb/1000) |
qsec | 1/4 mile time |
vs | V/S |
am | Transmission (0 = automatic, 1 = manual) |
gear | Number of forward gears |
carb | Number of carburetors |
Now let’s set up our first multi-variate regression. Let’s choose variables that we think could impact the mileage of the car. To do this, let’s select cylinders, displacement, horsepower, weight, gear, and carburetor. Notice that we can add explanatory variables to the model with the \(+\) symbol.
cars.lm<-lm(mpg~cyl+disp+hp+wt+gear+carb,data=cars) ##Remember that lm stands for linear model
Now let’s look at a summary of this data:
summary(cars.lm)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + wt + gear + carb, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.363 -1.609 -0.451 1.521 5.875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.47662 8.24058 4.18 0.00031 ***
## cyl -0.93081 0.81074 -1.15 0.26180
## disp 0.00881 0.01580 0.56 0.58233
## hp -0.02203 0.02029 -1.09 0.28781
## wt -3.23829 1.40689 -2.30 0.02996 *
## gear 1.09950 1.32828 0.83 0.41564
## carb -0.37782 0.73130 -0.52 0.60994
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.58 on 25 degrees of freedom
## Multiple R-squared: 0.853, Adjusted R-squared: 0.817
## F-statistic: 24.1 on 6 and 25 DF, p-value: 2.93e-09
Note that this output looks exactly like that of simple regression, except we have more entries in the “Coefficients” Table.
Let’s evaluate this model now. We start by looking at the Adjusted \(R^2\) and see that, at 0.8173, it appears to have a strong fit. It is definitely stronger than the \(0.75\) that we achieved using just weight in Lesson 3. We also notice that the p-value for the F-Statistic is very small indicating that we have a good model. Now we take a look at the p-values in the “Coefficients” table. This looks a little worrisome. We would like to see at least some of the variables under \(0.05\). Note that only weight is below \(0.05\), and some of them are very high, with number of carburetors coming in at \(0.6\). Note that these p-values measure the significane of the variables in the presence of all of the other explanatory variables. This means we can’t just drop all of the variables with high p-values all at once, because these can change significantly if we just drop a single variable. We have to start dropping variables with high p-values one at a time. Let’s start by dropping carburetors and see what our model looks like then:
cars.lm2<-lm(mpg~cyl+disp+hp+wt+gear,data=cars) ##Notice that we dropped carb
summary(cars.lm2)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + wt + gear, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.223 -1.686 -0.383 1.293 5.943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.3626 5.9725 6.26 1.3e-06 ***
## cyl -1.1186 0.7144 -1.57 0.1295
## disp 0.0138 0.0123 1.12 0.2727
## hp -0.0279 0.0166 -1.68 0.1052
## wt -3.7143 1.0482 -3.54 0.0015 **
## gear 0.6788 1.0345 0.66 0.5175
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.54 on 26 degrees of freedom
## Multiple R-squared: 0.851, Adjusted R-squared: 0.822
## F-statistic: 29.7 on 5 and 26 DF, p-value: 5.72e-10
Notice that our Adjusted \(R^2\) went up, but that all of our p-values changed. This is the reason you have to drop variables one at a time. You can’t perceive all of the interactions between the variables in the model, and can’t predict exactly what will happen as you start dropping variables. Let’s try dropping one more variable. The highest p-value in the new model is gears, so we’ll drop gears in our next iteration.
cars.lm3<-lm(mpg~cyl+disp+hp+wt,data=cars) ##Notice that we dropped gear
summary(cars.lm3)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + wt, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.056 -1.464 -0.428 1.285 5.827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.8285 2.7575 14.81 1.8e-14 ***
## cyl -1.2933 0.6559 -1.97 0.05895 .
## disp 0.0116 0.0117 0.99 0.33139
## hp -0.0205 0.0121 -1.69 0.10238
## wt -3.8539 1.0155 -3.80 0.00076 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.51 on 27 degrees of freedom
## Multiple R-squared: 0.849, Adjusted R-squared: 0.826
## F-statistic: 37.8 on 4 and 27 DF, p-value: 1.06e-10
In this new model, we see marginal improvement in the adjusted \(R^2\). Once again we see the p-values of our remaining coefficients change. You can see that this iterative methodology could be time consuming. This iterative methodology is what you would have to do if you were doing this in Excel (by the way, Excel makes it hard to drop variables, since to do so you have to completely delete or move the columns of your data). Luckily in R we have a function that will add and subtract variables for you and determine statistically what the best fitting model should be. The function we will use to do this is stepAIC. This function uses the Akaike information criterion (AIC) to measure the relative quality of a statistical model. AIC handles the trade-off between the goodness of fit of a model and the complexity of the model. The stepAIC function is found in the MASS package in R. The MASS package ships with R, so you don’t have to install it. It does not automatically load when you start R, however, and so you need to call library(MASS) before using the stepAIC function.
Below I show you how to use the stepAIC function. Notice that I’m using our original multi-variate model: cars.lm. This is because I want to give this stepAIC function a model with all possible variables, and let it drop variables until it arrives at the best model. The output of the function is the “winning” model. Note that I will assign the output to a variable name “winner”
library(MASS)
winner<-stepAIC(cars.lm) ##Calculates my winning model
Now let’s look at the summary of this model:
summary(winner)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.929 -1.560 -0.531 1.185 5.899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7518 1.7869 21.69 <2e-16 ***
## cyl -0.9416 0.5509 -1.71 0.0985 .
## hp -0.0180 0.0119 -1.52 0.1400
## wt -3.1670 0.7406 -4.28 0.0002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.51 on 28 degrees of freedom
## Multiple R-squared: 0.843, Adjusted R-squared: 0.826
## F-statistic: 50.2 on 3 and 28 DF, p-value: 2.18e-11
This gives us the statistical best model for this data. It has an adjusted \(R^2\) of 0.8263, marginally better than cars.lm3. The final explanatory variables that were left in the model were cylinders, horespower, and weight.
Just like last lesson, we once again need to check our four regression assumptions. Remember that our assumptions were:
Let’s create our plots to check these assumptions:
plot(winner)
From looking at our plots, we see that linearity and normality are close to being out of tolerance, but should be ok. We don’t seem to have any heteroscedasticity, or unequal variance. Looking at the Cook’s distance, we don’t see any outliers that will significantly affect our model. Finally, while our variables are somewhat related, we can assume we have enough independence that we can use our multiple-variate regression model.