This week, we covered Linear Regression, Logistic Regression, cross-validation, and gradient descent. If you've taken Andrew Ng's Machine Learning course (video lectures here), then a lot of these concepts will be review.

I found myself reading about more regularization techniques (we covered L1 & L2 this week). Regularization is a way to protect your model from overfitting by adding a regularization term to your cost function. This regularization term penalizes some summation of your feature coefficients by a factor of $\lambda$. $\lambda$ is called the shrinkage parameter, because in order to minimize your cost function, you must shrink your feature coefficient values.

When shrinking coefficient values, your model is less likely to fit noise, and your model becomes more "lean" and "efficient" in a sense, as you are effectively getting rid of features that don't add much on top of what other feature coefficients are capturing due to correlation. In other words, with an increased $\lambda$, both model variance and multicollinearity is reduced. However, too big of a value for $\lambda$ will drive your model closer to a horizontal line that spits out $y = \beta_0$ (underfitting), and too small of a value for $\lambda$ approximates normal regression without the regularization term, so $\lambda$ should be tuned using cross-validation.

L1 (Lasso) & L2 (Ridge) Regression

In L2 regression (Ridge regression), the regularization parameter penalizes the sum of the squared coefficient values. I did some googling around to figure out why it's called Ridge, and this is my understanding: In unregularized regression with multicollinearity present, there is a "ridge" (or long valley) in the likelihood function surface. In Ridge regression with multicollinearity minimized, the "ridge" gets minimized and takes more of a "bowl" shape. See the surface plots in these posts.

$$Ridge \ cost \ function = \sum\limits_{i=0}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta^2_j$$

However, the squared penalty might seem a bit harsh for the cost function, so let's look at L1 regression.

In L1 regression (Lasso regression), the regularization parameter penalizes the sum of the absolute coefficient values. Thus the name: LASSO = Least Absolute Shrinkage and Selection Operator.

$$Lasso \ cost \ function = \sum\limits_{i=0}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|$$

Previously, I had only learned Ridge. I'm guessing this is what I learned first because implementing the derivative of the cost function with a squared regularization term is much more straight-forward than implementing the derivative of the cost function with an absolute value regularization term.

We explored Ridge and Lasso regression using the sklearn diabetes dataset in one of our assignments last week. I'll share some code to illustrate the effect of regularization on feature coefficients.

First code block is for setting up our environment.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import scale
from sklearn.linear_model import Ridge, Lasso

X = diabetes.data[:150]
y = diabetes.target[:150]


Side note: if you're like me and wondering what the data and target values in the diabetes dataset represent, lucky for us that last year, someone complained about the lack of documentation, which you can now find here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/descr/diabetes.rst

Next, I wrote a function to iterate through a list of shrinkage parameters and plot the corresponding coefficient values. Yes... my code says alpha, which is inconsistent with the $\lambda$ that I've been throwing around in this post. I blame it on sklearn- their models take in a shrinkage parameter with the keyword argument alpha, so I stuck with their nomenclature.

def plot_coef(linmodel, alphas):
# Store parameters corresponding to each alpha
for i, a in enumerate(alphas):
X_data = scale(X)
fit = linmodel(alpha=a, normalize=True).fit(X_data, y)
params[i] = fit.coef_

# Plot: coefficient vs. alpha
fig = plt.figure(figsize=(14, 8))
sns.set_palette(sns.color_palette("Paired", len(params.T)))
for i, param in enumerate(params.T):
plt.plot(alphas, param, label='x{}'.format(i+1))
plt.legend(loc = 'lower right', ncol=5, fontsize=16)
plt.xlabel('alpha', fontsize=16)
plt.ylabel('coefficient', fontsize=16)
plt.title('{} Regression Coefficients'.format(linmodel.__name__), fontsize=24)
plt.show()


Now to generate some plots!

# L1 (Lasso) Regression
alphas = np.linspace(0.1, 3)
params = np.zeros((len(alphas), X.shape[1]))
plot_coef(Lasso, alphas)

# L2 (Ridge) Regression
alphas = np.logspace(-2, 2)
params = np.zeros((len(alphas), X.shape[1]))
plot_coef(Ridge, alphas)


These plots show that Lasso can drive coefficients to zero, whereas Ridge just makes them very small. So Lasso does double duty! It does variable selection automatically on top of coefficient shrinking. The Lasso plot is also much more interpretable.

Another side note: When I first generated these plots using the default matplotlib color palette, the plot colors started repeating after plotting 8 lines. In my quest to find a color palette that wouldn't repeat colors for my 10 feature coefficients, I found that seaborn has a seaborn.xkcd_palette from xkcd's color-naming survey! The xkcd blog post describing the survey results is pretty entertaining. The 954 most common color identifications from the survey are listed here (to be used with seaborn.xkcd_palette). So of course, I played around with it a bit...

Anyways.

In our assignment with the diabetes dataset, we compared Ridge, Lasso, and LinearRegression model errors (Lasso had the lowest error), which was how I was planning to pick between regularization techniques in the future. I did come across a thread: Cross Validated: When should I use lasso vs ridge?

Generally, when you have many small/medium sized effects you should go with ridge. If you have only a few variables with a medium/large effect, go with lasso. Hastie, Tibshirani, Friedman

Overall, it seems like there's no clear answer that one regularization is better than the other, as it is problem dependent. I believe Ridge is easier to implement due to its nice differentiability, but it can't zero out feature coefficients like Lasso.