Leverages

Notes on leverage

We try to generate a model with an outlier that happen to have high leverages as well.

We use simple linear regression given by

\[ Y_i = \beta_0 + \beta_1x_i+\varepsilon_i \]

# Create x samples
set.seed(11451)
x = 0:10
n = length(x)
b0 <- 5
b1 <- 5
y = b0 + b1*x + rnorm(n,mean=0,sd=2)
# The real values we created
df <- as.data.frame(cbind(x,y))
plot(df)
abline(b0, b1)

We fit this with regression.

# Model without outliers
model1 <- lm(y~x)
coefs1 <- model1$coefficients
plot(x, y)
abline(coefs1[1], coefs1[2], col = 'red')

Now add an outlier \((20,-10)\) to the model, we aim to achieve high leverages.

out <- 1
df <- rbind(df, c(30,out))
plot(df$x,df$y, col = ifelse(df$y==out, "red","blue"), pch=15)

We try to fit the model again and compute the leverage

model2 <- lm(df$y~df$x)
coefs2 <- model2$coefficients
plot(df$x,df$y, col = ifelse(df$y==out, "red","blue"), pch=15)
abline(coefs1[1], coefs1[2], col = 'blue')
abline(coefs2[1], coefs2[2], col = 'red')

To compute the leverage, we use hatvalues() function

hats <- as.data.frame(hatvalues(model2))
tail(hats)

Indeed, the last value has a leverage above \(0.5\) and close to \(1\). We now try to interpret this.

Note the leverage is the diagonal entry of \(P=X(X^TX)^{-1}X^T\) and therefore the influence of the \(k^{th}\) value on the fitted \(k^{th}\) value is given by

\[ \frac{\partial \hat{Y}_k}{\partial Y_k} = P_{kk} \]
A small derivative implies the \(k^{th}\) value does not really have a significant impact on the model.

This also implies (in a handwavy way) that the coefficients \(\hat{\beta}\) are influenced by \(P_{kk}\), as we recall \(X\hat{\beta}=\hat{Y}\)

Also note the fitted \(\hat{y}_k\) is given by
\[ \hat{Y}_k = \sum_{j=1}^{n} P_{kj} Y_k \]
And by idempotency of \(P\), \(P^2=P\)

\[ P_{kk} = P_{kk}^2 + \sum_{j \neq k} P_{kj}^2 \]
which not only shows \(P_{kk} \in [0,1]\) but also illustrates that when \(P_{kk} \to 1\), \(\sum_{j \neq k} P_{kj}^2 \to 0\), therefore \(\hat{Y}_k = \sum_{j=1}^{n} P_{kj} Y_k \to P_{kk}Y_k \approx Y_k\).

This again shows the value \(Y_k\) has a substantial effect on the model.