Common Regression Models in R

Regression models are one of the most useful statistical tools in data analysis.
This R-Markdown shows some simple regression models and how they can be used inside the R environment.
The regression models discussed in this document are:

Simple Linear Regression
Multiple Linear Regression
- Model Selection
Multivariate Linear Regression
ANOVA
- Multiple Comparison Testing
ANCOVA
MANOVA
MANCOVA
Logistic Regression
Poisson Regression

Load environment

rm(list=ls(all=TRUE))

# options
options(stringsAsFactors = FALSE)

Simple Linear Regression

We can use simple linear regression to fit a model that can be used to predict Petal Length given a Sepal Length.
This can simply be done using the lm() function.

# lets load the iris package
data("iris")

# Does Longer Sepal length mean longer Petal Length too?
linear.fit = lm(
  formula = Petal.Length ~ Sepal.Length,
  data = iris
)

# show results
summary(linear.fit)

## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47747 -0.59072 -0.00668  0.60484  2.49512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
## Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

The summary of the model shows that Sepal Length is a significant variable.
The R-squared is 0.76, which shows large portion of the dependent variable is explained by the independent variables.
The F-Statistic shows that this model is significantly better than the null model.
The coefficient of Sepal.Length is 1.86. + Because this is a positive coefficient, this means we have a positive correlation with Sepal.Length and Petal.Length (i.e. larger Sepal.Length corresponds to larger Petal.Length).

Multiple Linear Regression

We know that Sepal Length might not be the only variable related to Petal Length.
We can use more than one explanatory/independent variable to explain our dependent variable.
This can also be done in R using the lm() function.

# Can we explain/predict the petal length using all other variables in the iris data
multiple.linear.fit = lm(
  formula = Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width + Species,
  data = iris
)

# show results
summary(multiple.linear.fit)

## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width + 
##     Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78396 -0.15708  0.00193  0.14730  0.65418 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.11099    0.26987  -4.117 6.45e-05 ***
## Sepal.Length       0.60801    0.05024  12.101  < 2e-16 ***
## Sepal.Width       -0.18052    0.08036  -2.246   0.0262 *  
## Petal.Width        0.60222    0.12144   4.959 1.97e-06 ***
## Speciesversicolor  1.46337    0.17345   8.437 3.14e-14 ***
## Speciesvirginica   1.97422    0.24480   8.065 2.60e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2627 on 144 degrees of freedom
## Multiple R-squared:  0.9786, Adjusted R-squared:  0.9778 
## F-statistic:  1317 on 5 and 144 DF,  p-value: < 2.2e-16

The summary of the model shows that all independent variables are significant.
The R-squared and the Adjusted R-Squared are both very good (0.978, 0.978), showing that the dependent variable is explained very well by the independent variables (better than our previous model).
The F-Statistic shows that this model is significantly better than the null model.
We can look at the coefficients and make several conclusions: + Larger Sepal.Width usually corresponds to shorter Petal.Length + The other variables are positvely correlated to Petal.Length, thus larger values correspond to longer Petal.Length + Virginica species seems to have longer Petal.Length (largest coefficients)

Model Selection

Sometimes data sets have many possible explanatory variables, and it’s not clear what is the best set of independent variables that should be used to create a significant, well trained, generalizable model.
Expert knowledge should be used to select the best set of variables.
Expert knowledge can be combined with agnostic model selection approaches.
- Model selection approaches can also be used in settings where expert knowledge is lacking or not available.
We can perform stepwise model selection in R using the MASS package.

step.fit = stepAIC(
  object = multiple.linear.fit,
  direction = 'both' # "forward" and "backward" are also possible
)

## Start:  AIC=-395.12
## Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width + Species
## 
##                Df Sum of Sq     RSS     AIC
## <none>                       9.9397 -395.12
## - Sepal.Width   1    0.3484 10.2880 -391.95
## - Petal.Width   1    1.6974 11.6371 -373.47
## - Species       2    4.9133 14.8529 -338.87
## - Sepal.Length  1   10.1075 20.0472 -291.88

summary(step.fit)

## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width + 
##     Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78396 -0.15708  0.00193  0.14730  0.65418 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.11099    0.26987  -4.117 6.45e-05 ***
## Sepal.Length       0.60801    0.05024  12.101  < 2e-16 ***
## Sepal.Width       -0.18052    0.08036  -2.246   0.0262 *  
## Petal.Width        0.60222    0.12144   4.959 1.97e-06 ***
## Speciesversicolor  1.46337    0.17345   8.437 3.14e-14 ***
## Speciesvirginica   1.97422    0.24480   8.065 2.60e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2627 on 144 degrees of freedom
## Multiple R-squared:  0.9786, Adjusted R-squared:  0.9778 
## F-statistic:  1317 on 5 and 144 DF,  p-value: < 2.2e-16

Stepwise model selection determined that the best model is the one that includes all variables in the iris data set.
This makes sense since the iris data set does not have many variables.
This method becomes extremely useful when dealing with data sets that have dozens, hundreds or even thousands of variables.

Multivariate Linear Regression

Sometimes we want to model more than one dependent variable.
This can be done using Multivariate Linear Models.
This can also be easily accomplished using the lm() function in R.

# Predict Petal Length and Petal Width using all othe variables in iris data set.
multivariate.fit = lm(
  data = iris,
  formula = cbind(Petal.Length, Petal.Width) ~ Sepal.Length + Sepal.Width + Species,
)
summary(multivariate.fit)

## Response Petal.Length :
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Sepal.Width + Species, 
##     data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75196 -0.18755  0.00432  0.16965  0.79580 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.63430    0.26783  -6.102 9.08e-09 ***
## Sepal.Length       0.64631    0.05353  12.073  < 2e-16 ***
## Sepal.Width       -0.04058    0.08113  -0.500    0.618    
## Speciesversicolor  2.17023    0.10657  20.364  < 2e-16 ***
## Speciesvirginica   3.04911    0.12267  24.857  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2833 on 145 degrees of freedom
## Multiple R-squared:  0.9749, Adjusted R-squared:  0.9742 
## F-statistic:  1410 on 4 and 145 DF,  p-value: < 2.2e-16
## 
## 
## Response Petal.Width :
## 
## Call:
## lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Species, 
##     data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50805 -0.10042 -0.01221  0.11416  0.46455 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.86897    0.16985  -5.116 9.73e-07 ***
## Sepal.Length       0.06360    0.03395   1.873    0.063 .  
## Sepal.Width        0.23237    0.05145   4.516 1.29e-05 ***
## Speciesversicolor  1.17375    0.06758  17.367  < 2e-16 ***
## Speciesvirginica   1.78487    0.07779  22.944  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1797 on 145 degrees of freedom
## Multiple R-squared:  0.9459, Adjusted R-squared:  0.9444 
## F-statistic: 634.3 on 4 and 145 DF,  p-value: < 2.2e-16

This output should be very familiar to you at this point.
Interestingly, you can see that Sepal.Length was not too useful in predicting Petal.Width, and Sepal.Width was not useful in modeling Petal.Length.

ANOVA

ANOVA is Analysis of Variance.
It can be used to test if multiple variables are different among multiple groups.
For example, we can test whether all 3 species of plants have the same Sepal.Length.
ANOVA analysis can be done in R using the lm() function and anova() or aov() functions.

# H0 = The dependent variable is equal in all groups of the independent variable.
# HA = The dependent variable is not equal in all groups of the independent variable.
lm.fit=lm(
  data = iris,
  formula = Sepal.Length~factor(Species)
)
aov.fit=aov(lm.fit)
summary(aov.fit)

##                  Df Sum Sq Mean Sq F value Pr(>F)    
## factor(Species)   2  63.21  31.606   119.3 <2e-16 ***
## Residuals       147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since our ANOVA yielded a significant p-value for the species variable, we know that all the species do not have the same Sepal Length.
Maybe one species has a different sepal length from the others, or maybe, all species have different sepal length.
If we want to know which species have different sepal length from each other, we can use the TukeyHSD function, to perform post-hoc multiple comparison testing.
This function will perform all pairwise analysis and report an adjusted p-value.

Multiple Comparison Testing

TukeyHSD(aov.fit)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = lm.fit)
## 
## $`factor(Species)`
##                       diff       lwr       upr p adj
## versicolor-setosa    0.930 0.6862273 1.1737727     0
## virginica-setosa     1.582 1.3382273 1.8257727     0
## virginica-versicolor 0.652 0.4082273 0.8957727     0

According to our post-hoc analysis, all the species have different sepal length.

ANCOVA

If we want to compare a single dependent variable (Sepal.Length) to two variables: One categorical variable (Species) and one continuous variable (Petal.Length), we can use an ANCOVA.
ANCOVA stands for Analysis of Covariance.
This is very similar to ANOVA, with the addition of a continuous independent variable, which is termed a “covariate” and give ANCOVA it’s additional “C” in it’s name.
Again, we can perform ANCOVA in R combining the lm() function with the aov() function.

ancova.lm.fit=lm(
  data = iris,
  formula = Sepal.Length~factor(Species)
)
ancova.fit=aov(ancova.lm.fit)
summary(ancova.fit)

##                  Df Sum Sq Mean Sq F value Pr(>F)    
## factor(Species)   2  63.21  31.606   119.3 <2e-16 ***
## Residuals       147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(ancova.fit)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = ancova.lm.fit)
## 
## $`factor(Species)`
##                       diff       lwr       upr p adj
## versicolor-setosa    0.930 0.6862273 1.1737727     0
## virginica-setosa     1.582 1.3382273 1.8257727     0
## virginica-versicolor 0.652 0.4082273 0.8957727     0

MANOVA

Now, what happens if we want to compare 2 dependent variables (Sepal.Length, Sepal.Width) in 3 or more different groups (Species) ?
We can use a MANOVA, or Multiple Analysis of Variance.

manova.lm.fit=lm(
  data = iris,
  formula = cbind(Sepal.Length, Sepal.Width)~factor(Species)
)
manova.fit=aov(manova.lm.fit)
summary(manova.fit)

##  Response Sepal.Length :
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## factor(Species)   2 63.212  31.606  119.26 < 2.2e-16 ***
## Residuals       147 38.956   0.265                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Sepal.Width :
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## factor(Species)   2 11.345  5.6725   49.16 < 2.2e-16 ***
## Residuals       147 16.962  0.1154                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our MANOVA shows that both dependent variables (Sepal.Length, Sepal.Width) are not equal across the three groups inside the variable Species.
You can follow up these significant findings with two ANOVA + Multiple Comparisons to know which species have different Sepal.Length and Sepal.Width.

MANCOVA

Lastly, if we want to compare 2 dependent variables (Sepal.Length, Sepal.Width) in 3 or more different groups (Species) and include one or more covariates (Petal.Length, Petal.Width), we can do so using MANCOVA.
MANCOVA stands for Multiple Analysis of Covariance (as I assumed you expected).

mancova.lm.fit=lm(
  data = iris,
  formula = cbind(Sepal.Length, Sepal.Width)~factor(Species)+ Petal.Width + Petal.Length ,
)
mancova.fit=aov(mancova.lm.fit)
summary(mancova.fit)

##  Response Sepal.Length :
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## factor(Species)   2 63.212 31.6061  274.73 < 2.2e-16 ***
## Petal.Width       1  5.176  5.1759   44.99 4.125e-10 ***
## Petal.Length      1 17.099 17.0988  148.63 < 2.2e-16 ***
## Residuals       145 16.681  0.1150                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Sepal.Width :
##                  Df  Sum Sq Mean Sq F value    Pr(>F)    
## factor(Species)   2 11.3449  5.6725 64.7229 < 2.2e-16 ***
## Petal.Width       1  3.7554  3.7554 42.8496 9.548e-10 ***
## Petal.Length      1  0.4984  0.4984  5.6871   0.01838 *  
## Residuals       145 12.7081  0.0876                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because our MANCOVA yielded significant results, similar to the MANOVA case, you can follow this MANCOVA with multiple ANCOVA + Multiple Comparison testing to identify the groups that are different in the Species variable.

Logistic Regression

Now, what happens if you want to model a bi-variate outcome (i.e. an outcome with 2 possibilities) using other variables in your data set?
One of the most common and intuitive models for this would be using Logistic Regression.
Logistic regression can be easily done in R using the glm() function.
Let’s look an example of logistic regression using the Melanoma data set.
We will try to predict whether the melanoma was an ulcered melanoma using all the other variables available.
We will choose the best model using stepwise variable selection, using the stepAIC() function, as before.

# load the melanoma data
data("Melanoma")

# fit a logistic regression with all variables that 
logistic.fit = glm(
  data = Melanoma, 
  formula = ulcer ~ .,
  family = 'binomial'
)

# Perform stepwise variable selection
stepwise.logistic.fit = stepAIC(
  object = logistic.fit,
  direction = 'both'
)

## Start:  AIC=237.18
## ulcer ~ time + status + sex + age + year + thickness
## 
##             Df Deviance    AIC
## - age        1   223.23 235.23
## - year       1   223.40 235.40
## - sex        1   223.69 235.69
## <none>           223.19 237.19
## - time       1   226.06 238.06
## - status     1   226.70 238.70
## - thickness  1   250.89 262.89
## 
## Step:  AIC=235.23
## ulcer ~ time + status + sex + year + thickness
## 
##             Df Deviance    AIC
## - year       1   223.43 233.43
## - sex        1   223.73 233.73
## <none>           223.23 235.23
## - time       1   226.31 236.31
## - status     1   226.70 236.70
## + age        1   223.19 237.19
## - thickness  1   252.03 262.03
## 
## Step:  AIC=233.43
## ulcer ~ time + status + sex + thickness
## 
##             Df Deviance    AIC
## - sex        1   223.94 231.94
## <none>           223.43 233.43
## - time       1   226.88 234.88
## + year       1   223.23 235.23
## + age        1   223.40 235.40
## - status     1   228.12 236.12
## - thickness  1   254.88 262.88
## 
## Step:  AIC=231.94
## ulcer ~ time + status + thickness
## 
##             Df Deviance    AIC
## <none>           223.94 231.94
## + sex        1   223.43 233.43
## - time       1   227.62 233.62
## + year       1   223.73 233.73
## + age        1   223.90 233.90
## - status     1   228.68 234.68
## - thickness  1   258.08 264.08

# show the model summary
summary(stepwise.logistic.fit)

## 
## Call:
## glm(formula = ulcer ~ time + status + thickness, family = "binomial", 
##     data = Melanoma)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3190  -0.7996  -0.5980   0.9212   1.9530  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.5179847  0.6504255   0.796   0.4258    
## time        -0.0003055  0.0001610  -1.897   0.0578 .  
## status      -0.6787092  0.3145346  -2.158   0.0309 *  
## thickness    0.4080844  0.0851858   4.791 1.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 281.13  on 204  degrees of freedom
## Residual deviance: 223.94  on 201  degrees of freedom
## AIC: 231.94
## 
## Number of Fisher Scoring iterations: 5

The best model included the variables time, thickness, status.
We can see that “thickness” was the most significant of these, and had the largest coefficient.
An increase of one unit in the thickness variable, corresponds to an increase in the probability of a melanoma being ulcered. We can see this because the coefficient for thickness is 0.4, which is a positive number, and therfore positively correlated to increasing the value of ulcer from 0 to 1 (and 1 means ulcered, 0 means not ulcered).

Poisson Regression

If we want to model an outcome that is a count variable, we can use a Poisson regression.
A few examples of variables that are a good fit for Poisson regression are: + Number of times patient comes back to the ER per year given clinical covariates + Number of people that pass by the drive through depending on the season and food quality + Number of people that pay with credit card depending on socio-economic covariates
We will show how to perform Poisson regression using the glm() function.
We will demonstrate on the warpbreaks data set.

# load warpbreaks data
data("warpbreaks")

# fit poisson regression
poisson.fit = glm(
  data = warpbreaks, 
  family = 'poisson',
  formula = breaks ~ wool + tension
)
summary(poisson.fit)

## 
## Call:
## glm(formula = breaks ~ wool + tension, family = "poisson", data = warpbreaks)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6871  -1.6503  -0.4269   1.1902   4.2616  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.69196    0.04541  81.302  < 2e-16 ***
## woolB       -0.20599    0.05157  -3.994 6.49e-05 ***
## tensionM    -0.32132    0.06027  -5.332 9.73e-08 ***
## tensionH    -0.51849    0.06396  -8.107 5.21e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 297.37  on 53  degrees of freedom
## Residual deviance: 210.39  on 50  degrees of freedom
## AIC: 493.06
## 
## Number of Fisher Scoring iterations: 4

Common Regression Models in R

Carlos C Vera Recio

Load environment

Simple Linear Regression

Multiple Linear Regression

Model Selection

Multivariate Linear Regression

ANOVA

Multiple Comparison Testing

ANCOVA

MANOVA

MANCOVA

Logistic Regression

Poisson Regression