Lejdi Prifti

0 %
Lejdi Prifti
Software Engineer
Web3 Developer
ML Practitioner
  • Residence:
  • City:
  • Email:
React & Angular
Machine Learning
Docker & Kubernetes
AWS & Cloud
Team Player
Time Management
  • Java, JavaScript, Python
  • AWS, Kubernetes, Azure
  • Bootstrap, Materialize
  • Stylus, Sass, Less
  • Blockchain, Ethereum, Solidity
  • React, React Native, Flutter
  • GIT knowledge
  • Machine Learning, Deep Learning

No products in the basket.

Machine Learning 01 – Linear Regression

7. January 2024

In this series of articles, we are going to look at an introduction to machine learning. After reading each item in turn, you should have a better grasp of machine learning and be able to experiment with it. In this article, I am going to present you linear regression, one of the simplest models in machine learning.

Table of Contents

What is machine learning?

IBM defines machine learning as a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. Murphy on his book Machine Learning states that the goal of machine learning is to develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest.

The focus on increasingly complex models and application in practical prediction is what sets machine learning apart from statistics. However, statistical inference is the process of using statistics as a tool to draw conclusions from data. Under certain presumptions, it uses statistical guarantees.

Machine Learning vs Traditional Programming

The image above describes perfectly the differences between machine learning and traditional programming. In traditional programming, you would provide the input and the rules (the code) in order to get the answers. For example, you would create a function that says sumNumbers(int a, int b). First of all, you have the input which are the integers a and b. Then, you wrote the rule which is a+b. Machine learning does not work this way. It uses input and answers to find the rules. As written above, machine learning is able to detect patterns in data. If we were to apply the previous example to machine learning, we would have to build a dataset of inputs and answers, feed that dataset to one of the machine learning algorithms (or many) and finally, get the rules. 

Types of machine learning

There are three types of machine learning problems.

  • Supervised Learning – i.e. being able to distinguish photos of hand-written numbers. Is it a 3 or an 8? 
  • Unsupervised Learning – i.e. learning to distinguish photos of hand-written numbers. 3 is different from 8.
  • Reinforcment Learning – i.e. play chess.

Linear regression

Linear regression is a supervised learning algorithm specifically designed for regression tasks, where the goal is to predict a continuous outcome variable. It models the relationship between the input features and the target variable by fitting a linear equation to the observed data. Unlike classification tasks, where the objective is to predict a categorical outcome, linear regression is focused on estimating the relationship between variables to make continuous predictions.

The equation for building this model is

\hat{y} = f(x) = B_{1}x + B_{0}

and its visual representation is displayed below.

B_{0} is known as the interceptor,  while B_{1} is called the slope because it determines the angle of the blue line below.

Linear regression from W3Schools

The goal is to fit the model to the data, which means that we must find B_{0} and B_{1} such that f(x_{i}) \approx y.

Loss function

A loss function (or cost function) is a measure of how well a machine learning model is performing based on its predictions compared to the actual values (ground truth) of the target variable. The goal during the training of a machine learning model is to minimize the value of the loss function. 

In linear regression, the loss function used is Mean Squared Error (MSE). There are multiple reasons why this loss function is used with linear regression. Since the goal of linear regression is to find the line that minimizes the sum of squared differences between the predicted values and the actual values, MSE directly quantifies this objective by penalizing larger errors more heavily. Additionally, minimizing MSE is equivalent to maximizing the likelihood of the observed data under the assumption of normally distributed errors.

\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2

If we were to break down MSE, we would say that it firsly calculates the square difference between the ground truth and the predicted value. It does this for all n data in our dataset and at the end, calculates the sum of all these squared differences. Finally, it divides the sum by the total number of data calculating the mean. And that is Mean Squared Error.f

A slope only function

Let’s consider a slope only function. We have a function that looks like this

\hat{y}= f(x_{i}) = B_{1}x_{i}

For such a function, the MSE would look as follows

\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i – B_{1}x_{i})^2

If we were to plot it, the image below displays how it would look like.

Our goal is to identify the minimal that results in the least amount of loss. The lowest point on the curve is the one that results in the minimum loss. Our model would best fit the data if it had the lowest loss. How could we find the minimum parameter B_{1} that produces the lowest \text{MSE}? By simply minimizing its derivative function. 

loss function mse
Loss curve

Gradient descent

You found the answer! We find the minimum of a function (i.e. MSE) by using an optimization function called gradient descent. The main idea behind gradient descent is to iteratively move towards the minimum of the function by adjusting the parameters in the direction of the steepest decrease in the function.

Here’s a simplified explanation of how gradient descent works:

  1. Initialization:
    • We start with an initial guess for the parameters or weights of the model, in this case B_{1}.
  2. Compute Gradient:
    • We calculate the gradient (derivative) of the function with respect to each parameter at the current point. Our only parameter will be B_{1}. The gradient indicates the direction of the steepest ascent.
  3. Update Parameters:
    • We move in the opposite direction of the gradient to decrease the function value. If it is positive, we go on the left. If it is negative, we go on the right. This involves updating the parameters using a small step size called the learning rate.
  4. Repeat:
    • Finally, we must repeat steps 2 and 3 until convergence, a predefined number of iterations, or when the change in the function value becomes very small.

The update rule for the parameter (B_{1}) in each iteration is given by:

B_{1}=B_{1}-\eta*\partial L(B)/\partial B

where \eta is the learning rate.

gradient descent
Gradient descent from Clavoryiant

If the learning rate is too large, the steps will be bigger and we could jump back and forth on the other side of the curve. Consequently, we would miss the global minimum. On the other hand, if the learning rate is too small, the function would take longer to converge. We would take tiny steps to the global minimum, but the time spent would be much higher. 

During each iteration, you can evaluate. If the loss function is lower than a tolerance value you specify, you might decide to stop and accept the value of B_{1} as the best value.

Non-Convex function

However, we could come across a non-convex function.

A non-convex function is a mathematical function for which the shape of the graph includes at least one concave (curving downward) region. In simpler terms, a function is non-convex if there exists at least one interval where, if you draw a straight line between any two points within that interval, the line lies below the graph of the function.

Finding the global minimum in such a function would be more difficult, since we could get stuck in a local minimum. Luckily, linear regression is a convex function and we can use the gradient descent to solve the function.

Convex function vs Non-Convex function

Partial derivatives

Going back to our function with two parameters, interceptor and slope. When we had a slope-only function, we had to find the global minimum by constantly updating the slope. Now that we have two parameters, we have to update both of them to find the global minimum. Said in simple terms, in order to go in the right direction, we must calculate the partial derivatives with respect to any of the two parameters. Afterwards, we will draw the conclusions in the same way we did with the slope-only function. 

\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i – B_{1}x_{i}-B_{0})^2

When we calculate the partial derivative with respect to B_{1}, we keep the B_{0} unchanged. 

\frac{\partial}{\partial B_1} \text{MSE} = -\frac{1}{m} \sum_{i=1}^{m} 2(y_i – (\hat{B}_1x_i + B_0))x_i

On the other hand, when we calculate the partial derivative with respect to B_{0}, we keep the B_{1} unchanged.

\frac{\partial}{\partial B_0} \text{MSE} = -\frac{1}{m} \sum_{i=1}^{m} 2(y_i – (\hat{B}_1x_i + B_0))

The partial derivatives tell us in which direction to go to minimize the loss function. During each iteration, we update both parameters until we find the global minimum.

gradient descent
Gradient descent seen from above, from SpringerLink

Linear Regression in ScikitLearn

In this section, we are going to look at how we can use Linear Regression in ScikitLearn.

Load the dataset

First off, 442 patients’ measurements of ten physiological variables (age, sex, weight, blood pressure) and a one-year disease progression indicator comprise the diabetes dataset.

					from sklearn import datasets
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
Plot the dataset

Using the diabetes dataset’s first attribute, we will create a straightforward plot to illustrate our point. It is important to see the appearance of our data.

					import matplotlib.pyplot as plt

plt.scatter(x = diabetes_X[:, 0], y = diabetes_y)
linear regression graph
Diabetes data plot
Split the dataset

We must divide the data into training and test sets before we can begin training our model. We will use 20% of the data as a test set and the remaining portion as our training set.

					from sklearn.model_selection import train_test_split
diabetes_train_X, diabetes_test_X, diabetes_train_y, diabetes_test_y = train_test_split(diabetes_X[:, 0].reshape(-1, 1), diabetes_y, test_size=0.2)
Fit the model

It’s time to fit our linear regression model.

					from sklearn import linear_model
lr = linear_model.LinearRegression()
lr.fit(diabetes_train_X, diabetes_train_y)
Plot the line

Our model learnt the best intercept and slope for minimizing the MSE loss. Let’s plot the line. To draw a line we need to know two points. That’s why I have picked x to be -0.1 and 0.1.

					x = [-0.1,0.1]
y = [lr.coef_ * i + lr.intercept_ for i in x]
plt.scatter(x = diabetes_X[:, 0], y = diabetes_y)
plt.plot(x,y, c='r')
Diabetes data and linear regression line
Caclulate the MSE

For me, it resulted to be approximately 6193.9

					import numpy as np
np.mean((lr.predict(diabetes_test_X) - diabetes_test_y)**2)

Final thoughts

If you have any questions about linear regression in machine learning, feel free to write them here or contact

Consider sharing and donating.

Thank you!

Buy Me A Coffee
Posted in TechnologyTags:
Write a comment