L1 (Lasso) and L2 (Ridge) Regularization

5 min readMar 14, 2023

What is regularization?

Regularization refers to techniques that are used to calibrate machine learning models in order to minimize the adjusted loss function and prevent overfitting or underfitting.

Why we need these regularization methods?

Continue with an example. Let’s assume that our dataset is like this :

data = [(1,1), (2,2), (5, 6), (3,2), (2, 4), (3,4), (1,3), (4,7)]

Let’s make a bad choice of separating the dataset into test and train so that the loss function is easy to calculate. If we try to show the train data (red dots) with linear regression, a line passing through two points will be drawn as shown below.

Kindly Reminder : Square loss is the overall amount of error in your model. The squared loss function is the sum of the squares of the distance between the model’s estimate and the actual value, as seen below. And it should be remembered that the main goal is to minimize loss.

The train loss value of this model will be 0. Because when calculating the loss function, the distance between the points and the line drawn by the model was checked. At this point, the total training loss will be 0, as the line predicted by the model touches both points. Wow, it’s great! Our training loss is zero. But when we calculate the test loss it’s obviously seen that it will explode. But wait.. why is our test score so high when our train score is as low as possible? The answer is, model is overfit. It memorized train data perfectly. However, we could not build a generalizable model. Okay then, how we can decrease test loss when we are on global minima in terms of train loss? Well, we should add penalty! In order to add penalty and avoid overfitting we should use L1 or L2 regularizations.

Now we know why we need L1 and L2 regularizations. Regularizations provides better long-term predictions by helping decrease the model’s test loss.

Adding Penalty :

In order to avoid overfit, our predictions must be less sensitive to training data. But what’s meaning of making less sensitive? Well, let’s remind the slope’s meaning. Slope of a line is a measure of its steepness. In other words, change in y is divided by change in x. So it determines how big our features effect models output. (e.g. slope is 3 means one unit x changes affects 3x changes on y.)

For making less sensitive to training data, we can add penalty to training loss according to decrease the slope. Therefore model will not be fit perfectly to training data. Of course adding penalty will increase training loss but since our main goal is decreasing the test loss, then we will be succeed. But how? We have to not only use slope but also lambda. Lambda determines how severe that penalty is. In other words, Lambda will control how aggressively the model make changes.

The point that should not be confused is regularization logic is increasing training loss. Then instead of substract we have to add this penalty to training loss!

L1 (Lasso) Regularization:

New loss function : The sum of the squared residuals + λ.(slope)

L2 (Ridge) Regularization:

New loss function : The sum of the squared residuals + λ.(slope)²

As you can see this two regularization method is really similar!

Now lets continue to our example and calculate regularized test data scores for each method and compare the results! First of all we need to determine our dataset:

# lets remember our dataset [(1,1), (2,2), (3,1), (4,1), (5,2)]

X_train, X_test, y_train, y_test = np.array([1., 2.]), np.array([3., 4., 5.]), np.array([1., 2.]), np.array([1., 1., 2.])
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)

After that we can initialize and train models according to datasets. Lambda can be any value from 0 to positive infinity. Note that the larger we make lambda, the slope gets asymptotically close to 0. So the larger lambda gets, our predictions for “y axis” become less and less sensitive to weight. For this example let’s say the alpha value is equal to 0.1.

from sklearn.linear_model import LinearRegression, Lasso, Ridge

regression = LinearRegression()
regression.fit(X_train, y_train) 

lasso = Lasso(alpha=0.1, max_iter=100)
lasso.fit(X_train, y_train)

ridge = Ridge(alpha = 0.1, max_iter=100)
ridge.fit(X_train, y_train)

And get results as pandas dataframe:

error_dict = {"Train" : (regression.score(X_train, y_train), lasso.score(X_train, y_train), ridge.score(X_train,y_train)), 
                    "Test" : (regression.score(X_test, y_test), lasso.score(X_test, y_test), ridge.score(X_test,y_test))}

df = pd.DataFrame(error_dict, index =['Linear Regression', 'Lasso', 'Ridge'], columns =['Train', 'Test'])

Results are exactly what we expected! Linear regression train score is incredible but test result is not too good. When we apply penalties with using Lasso and Ridge regularizations train score decreased. But good new is our test scores improved. Therefore our model is more generalized. In other words, regularization helps to reduce variance .Since our alpha value is lover than 1, Lasso’s results looks better. What if our lines? Let’s check and interpret graph together.

As expected, lasso and ridge fitted better than linear regression! One cool note is as we can see on the above graph, Ridge and Lasso regressions are linear models.

How we can choose correct regularization method?

Well, in simple terms, Ridge regression is helpful when we have highly correlated predictors. Lasso regression is useful when we have too many features and want to simplify the model by selecting only the important ones.

You can access Notebook from here.

See you in the next article!

L1 (Lasso) and L2 (Ridge) Regularization

What is regularization?

Why we need these regularization methods?

Adding Penalty :

How we can choose correct regularization method?

References

Written by Semih Gülüm