Linear Regression — A Beginner’s Guide

Published in

AlmaBetter

10 min readApr 6, 2021

Understanding and building a custom ML algorithm from scratch in python

Whenever we talk about machine learning models in general, the very first and most simplistic model that comes to our mind is the Linear Regression model. Although its quite simple and easy to understand, it has applications in a wide variety of problem statements. Every data scientist, once in a while, uses this algorithm to solve complex business problems at hand. It is widely popular in the data science community due to its simplicity and effectiveness in real life use cases.

My goal is to make you understand the basics of this algorithm and also make you feel proud of yourself by helping you build the entire algorithm from scratch in python. Lets get started right away !

Linear Regression as the name suggests is used for regression tasks. The goal of this algorithm to quantify the linear dependence of a particular dependent variable on the independent variable. In order to achieve this, the model suggests a straight line fit that minimizes the squared error between the actual dependent variable values and the predicted ones. Lets take an example to understand this more clearly.

Lets say you have a lot of interest in investing in stocks. You have been monitoring the market price of a certain stock for a couple of months now. You observe that for the past few months, the stock shows an increasing trend. And now, based on the data you have, you want to predict what the stock price is going to be after about a month. You also observe that the increase in stock price almost proportional to the number of days. It is here you need to apply linear regression to come up with a linear equation that appropriately describes the change of stock price with number of days. Consequently you will also be able to predict the price after about a month, if the current trend persists.

Now given a particular data set where a target/dependent variable is linearly dependent Y on the independent variable X, lets see how we can come up with the linear red line in the above figure, that approximately predicts the value of Y given a particular X.

Let us take a classic example of uni-variate function Y, which is a function of X, i.e. we have a data set with only one independent variable X and dependent variable Y. Let X take values from 0 to 10. While Y is a function of X plus some noise.

import numpy as np
import matplotlib.pyplot as pltX = np.arange(0,10,.05)
y = X + np.random.normal(loc=0.0, scale=1.5, size=len(X))
X = X[..., np.newaxis]

The linear model that we are going to fit to this data is given by :

A more general form of linear model with k-feature set is given by :

The goal is to come up with appropriate value of model parameters (β0,β1,…,βk) such that variation of predicted Y from actual Y is minimum. The loss function that we use here to calculate the variation of predicted Y from actual Y is given by :

where n is the total number of observations. This loss function differs from the Mean Squared Error (MSE) by a factor of 1/2.

We do not usually perform training on the entire data set. Instead we split the data set into two, namely training and test data set. We train the model using training data set and then check the performance of the model using test data set.

Since we need to minimize the loss function with respect to the model parameters, we need,

where k is the total number of features.

These equations are also known as the normal equations. A model with k independent variables has k+1 such parameters to optimize where β0 is the bias term while β1. β2,…,βk are weights associated with each of the variables. Consequently we get k+1 such normal equations to solve for the k+1 parameters (β0,β1,…,βk). Solving these equations leads us towards finding appropriate values for (β0,β1,…,βk) such that the loss function is minimized. However, this is a quite straightforward analytical approach.

There is an iterative approach to solve this problem as well. Although there are numerous such algorithms that gradually lead us to the global minima of the loss function through iterations, the most commonly used one is the Gradient descent algorithm. We’ll be discussing this approach in detail and eventually use it to build our linear regression model in python.

Gradient descent algorithm

This optimization algorithm gradually converges the values of model parameters (β0,β1,…,βn) towards global minimum of the loss function. The parameters are updated after each training epoch in following manner :

where α is the learning rate of the optimization algorithm.

Usually, a smaller learning parameter α is used in order to avoid overshooting the global minima. However too low values for α might also lead the linear model to get stuck in a local minima. We need to figure out a sweet spot that is perfect for the algorithm to converge to the global minima.

The above discussed algorithm is also known as batch gradient descent algorithm, since the model parameters are updated only after predicting the values for all the observations in the data set. However there is a slightly better optimization algorithm that often gives better results although being computationally expensive sometimes, which is called the stochastic gradient descent. In this method, the model parameters are updated after predicting for the target variables for each of the observations. Stochastic gradient descent is nothing but batch gradient descent with batch size = 1.

Now lets try to implement this in python by creating a Linear Regression class.

class Linear_Regression:  def __init__(self, learn_rate = 0.001, num_epochs=500):
    self.num_epochs = num_epochs
    self.learn_rate = learn_rate  def fit(self, X, y, batch_size=1):
    assert len(X.shape) == 2
    assert X.shape[0] == y.shape[0]    self.sample_size, self.feature_size = X.shape
    self.weights = np.random.random(self.feature_size+1)
    self.batch_per_epoch = self.sample_size//batch_size + 1\
        if self.sample_size//batch_size else self.sample_size//batch_size    for _ in range(self.num_epochs):
      for i in range(self.batch_per_epoch):
        output = self.predict(X[i*batch_size:(i+1)*batch_size,:])
        error = output — y[i*batch_size:(i+1)*batch_size]
        self.weights[1:] -= self.learn_rate*error.dot(X[i*batch_size:(i+1)*batch_size,:])
        self.weights[0] -= self.learn_rate*np.sum(error)
        self.coef_, self.intercept_ = self.weights[1:], self.weights[0]  def predict(self, X):
    return np.dot(self.weights[1:],np.transpose(X)) + self.weights[0]  def fit_predict(self, X, y, batch_size=1):
    self.fit(X, y, batch_size)
    return self.predict(X)

Now we will implement our custom Linear regression algorithm on our uni-variate data where X is the only independent variable and Y is the target variable. Before doing that we need to split our data set in to train and test sets, preferably in 70–30 ratio.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,shuffle=True)

Having split the data, now we would like to train the data using our train data set, and try to plot the regression line along with train data set

lr_model = Linear_Regression()
y_train_pred = lr_model.fit_predict(X_train, y_train)

Now lets see how our model performs on test data set :

y_test_pred = lr_model.predict(X_test)

Lets look at some evaluation metrics to see how our model has performed. Lets look at MSE and r2-score.

MSE (train) =  1.6409363739199418 
R-squared (train) =  0.8153303325365319MSE (test) =  1.918687920414487 
R-squared (test) =  0.8139978314290932

We observe that the model performs equally well on test data set as in the train data set. But unfortunately, in most real life scenarios, in the process of building a model for our data set, we run into over-fitting our train data set due to numerous factors such as high model complexity or having a small data set. The model ends up learning the noise in the data set and hence performs very poorly on a novel test data set. To avoid such issues, we have a method called regularization that helps to generalize our model.

Regularization is a method that involves penalizing our model parameters for trying to over-fitting the training data set. The loss function, which earlier involved just the mean squared error term, is now added by an extra positive factor as a penalty to the model parameters for trying too hard to learn the training features:

where λ is known as the regularization strength and k is the total number of features in data set. λ is the measure of penalty we incur on our model parameters. The role of λ is to shrink the magnitude of weights of certain model parameters such that the dominance of certain features over the output variable decreases by considerable amount. A higher λ penalizes the model parameters severely and tries to generalize the model for unseen data by introducing bias into the system. However a low λ corresponds to a model with high variance and low bias. This type of regularization is also known as L2-regularization. We also have L1 regularization wherein the cost function is given by :

For the time being, lets use L2-Regularization to generalize our data. Given cost function for L2 regularization technique, the update function for the model parameters are given by :

Now, lets implement the following L2-regularization technique in our Linear Regression algorithm.

class Linear_Regression:  def __init__(self, learn_rate = 0.001, num_epochs=500, lamda=0):
    self.num_epochs = num_epochs
    self.learn_rate = learn_rate
    self.lamda = lamda  def fit(self, X, y, batch_size=1):
    assert len(X.shape) == 2
    assert X.shape[0] == y.shape[0]    self.sample_size, self.feature_size = X.shape
    self.weights = np.random.random(self.feature_size+1)
    self.batch_per_epoch = self.sample_size//batch_size + 1\
        if self.sample_size//batch_size else self.sample_size//batch_size    for _ in range(self.num_epochs):
      for i in range(self.batch_per_epoch):
        output = self.predict(X[i*batch_size:(i+1)*batch_size,:])
        error = output — y[i*batch_size:(i+1)*batch_size]
        self.weights[1:] = (1-2*self.learn_rate*self.lamda)*self.weights[1:] - self.learn_rate*error.dot(X[i*batch_size:(i+1)*batch_size,:])
        self.weights[0] = (1-2*self.learn_rate*self.lamda)*self.weights[0] - self.learn_rate*np.sum(error)
        self.coef_, self.intercept_ = self.weights[1:], self.weights[0]  def predict(self, X):
    return np.dot(self.weights[1:],np.transpose(X)) + self.weights[0]  def fit_predict(self, X, y, batch_size=1):
    self.fit(X, y, batch_size)
    return self.predict(X)

Let us try to train our training data using regularized linear regression we just defined :

lr_model = Linear_Regression(lamda=0.1)
y_train_pred = lr_model.fit_predict(X_train, y_train)

y_test_pred = lr_model.predict(X_test)

MSE (train) =  1.6435681949373844 
R-squared (train) =  0.8150341495035764MSE (test) =  1.9183774145855326 
R-squared (test) =  0.8140279326023612

With regularization introduced into the model, we observe that the R2-scores of regularized model on train and test data sets, although negligible, shows some improvement as compared to standard Linear Regression model. Since, the data set deals with a single feature in deciding the value for our target variable, we don’t observe a significant improvement in results by using regularization. However in complex models, where model training often leads to over-fitting, we might actually see a significant improvement in R2-scores after using regularization techniques.

Now let me briefly discuss what are the different built-in modules present in python libraries that are used to train our models using linear regression and regularization techniques.

from sklearn.linear_model import LinearRegression 
from sklearn.linear_model import Lasso, Ridge, ElasticNet

Scikit-learn (sklearn) is a well-known library in python which is widely used by data scientists. It consists of a large number of machine learning algorithms that are used for different purposes. One of them is Linear Regression which is the standard algorithm with no regularization as such. However, Lasso Regression is nothing but linear regression with L1 regularization. Ridge is the linear regression algorithm with L2-regularization. While ElasticNet is a special kind of regularized linear regression algorithm which utilizes a weighted sum of both L1 and L2 regularization.

Before closing off, I would like to highlight upon a couple of factors that we need to keep in mind while using any Linear regression algorithm to model our data. These criterion will ensure that we find the best fit model for our data set :

Linear Assumption : Linear Regression assumes that the relationship between our independent and dependent is linear. It does not support anything else. This may be obvious, but it is good to remember when you have a lot of attributes. You may need to transform data to make the relationship linear.
Check for Residuals and homoscedasticity : The mean of residuals (difference between predicted and actual target variable ) must be close to 0 as much as possible. Also, the variance around the regression line should be same for all values of predictor variable.
Check for Multi-collinearity : There should not be any multi-collinearity in the regression model. Multi-collinearity generally occurs when there are high correlations between two or more independent variables.
Gaussian Distributions : Linear Regression will make more reliable predictions if our independent and dependent variables have a Gaussian distribution.
Re-scaling inputs : Linear Regression often make more reliable predictions if you re-scale input variables using standardization or normalization.

Linear Regression — A Beginner’s Guide

Written by Smruti Ranjan Pradhan