Linear Models
Linear models describe the relation between two variables. In a non-technical description we can think about linear model like:
Find a line that is as close as possible to all the points of a scatterplot.
For example, we can use a linear model to depict the relation between the amount of samples animals and the number of different species found in a given area. To create a linear model, we need two vectors (here as two columns of a data.frame) with the matching data pairs (Abundance and Species Number). The lm
function then takes a formula with the vector/column names separated by ~
. You can interpret the ~
in this case as a function of. So lm(Species ~ Abundance)
means: Give me the number of species as a function of species abundance.
library(dplyr)
= read.csv("data/crop_species.csv")
bd
= bd %>% mutate(Abundance = AraInd + CaraInd + IsoMyrInd,
bd Species = AraSpec, CaraSpec, IsoMyrSpec)
= lm(Species ~ Abundance, bd)
lmod lmod
Call:
lm(formula = Species ~ Abundance, data = bd)
Coefficients:
(Intercept) Abundance
10.71651 0.01749
Linear Model Summary
Calling the model returns the intercept and slope. To get a more useful output use the summary
function. In the following we will go over the output of the summary
function on linear models. The explanations are mostly taken from this blog entry: https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
summary(lmod)
Call:
lm(formula = Species ~ Abundance, data = bd)
Residuals:
Min 1Q Median 3Q Max
-6.3136 -1.2481 -0.2235 1.1245 5.3046
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.716507 0.991039 10.813 1.58e-15 ***
Abundance 0.017491 0.003656 4.784 1.22e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.416 on 58 degrees of freedom
Multiple R-squared: 0.2829, Adjusted R-squared: 0.2706
F-statistic: 22.89 on 1 and 58 DF, p-value: 1.221e-05
Call
- the function that was used to create the model
Residuals
- summary statistic of the model error
- “How far away are the real values from the model?”
- should follow a normal distribution (e.g. by plotting a histogram)
hist(lmod$residuals)
Coefficients
Estimate
- again the slope and intercept
- we can expect to observed at least 10.7 different species on a plot (if we collect 103 animals)
- for each additional animal observed, the species count increases by 0.017
Std. Error
- the expected error of the coefficients
- for each additional animal observed, the species number increases by 0.017 +- 0.0037
t value
- how many standard deviation is the std. error away from zero?
- far away from zero is better. Reduces the probability that the result is due to chance
- usually not interpreted
Pr(>|t|)
- probability that the observed value is larger than t
- small value means that there is a high chance that the relation is not based on chance
- p values < than 0.05 is the convention that we can neglect the H0 hypothesis (the model actually depicts a relation)
Signif. codes
- a legend for the significance level of the p-value
Residuals standard error
- the average deviation of the real values from the model value
- degree of freedom: on how many observations is this value based on?
= bd$Species
actual_values = predict(lmod, data.frame(Abundance = bd$Abundance)) predicted_values
For the specific Abundance values we have in our observed data, we can calculate the error i.e. the deviation of the predicted values from the actual values:
# deviation between actual values and predicted values ("error")
= actual_values - predicted_values
residuals
# squared residuals: no negative values, large error values are more severe
= (actual_values - predicted_values)**2 squared_error
However, we also want to give estimates of the models error for abundance values that we have not observed.
# number of observations
= length(actual_values)
n
# mean squared error
mean(squared_error)
[1] 5.643017
# mean squared error with degrees of freedom
sum(squared_error)/(n-1)
[1] 5.738662
# root mean squared error
sqrt(mean(squared_error))
[1] 2.375504
# root mean squared error with degrees of freedom == Residuals standard error
sqrt(sum(squared_error)/(n-2))
[1] 2.416113
Multiple R-squared
- How much of the variance is explained by the model?
- Values between 0 and 1
- 0 means no explanation, 1 means perfect fit
# how much variation is there in the observed values
= sum(((actual_values) - mean(actual_values))**2)
total_sum_of_squares
# how much variation is there in the residuals
= sum((actual_values - predicted_values)**2)
residual_sum_of_squares
# divide one by the other:
# if both values are similar, this approaches 1
# if residual sum of squares is very low (i.e. the model is good) this value gets smaller
/ total_sum_of_squares residual_sum_of_squares
[1] 0.7170542
# Rsq: values close to 1 mean, the residual sum of squares error is low compared to the observed variation
1 - residual_sum_of_squares / total_sum_of_squares
[1] 0.2829458
F-statistics
- Is there a relation between the predictor and the response?
- Is my model better than if all the coefficients are zero?
- Away from 1 is good!
- Depend on number of data points.
Linear Model Plot
Plotting the linear model gives you 4 different figures that indicate if the model is statistically valid in the first place, i.e. if the residuals follow a normal distribution and if they are homoscedatic.
Residuals vs. Fitted
- Is it actually a linear relation?
- Points randomly scattered around a horizontal line indicates this
Normal Q-Q
- Are the residuals normally distributed?
- Points on the 1-1 line indicates this
- Depicted are the actual residuals vs. a theoretical normal distribution
Scale location
- Are the residuals homoscedatic?
- This means that the range of the residuals are more or less equal for the whole data range
- Equally spread points around the line indicates this
Residuals vs. Levarage
- Are there outliers in the data that have influence on the model?
- In this example, if we would leave out point #12, the model outcome would change.
plot(lmod)