Home » Machine Learning 01 – Linear Regression
In this series of articles, we are going to look at an introduction to machine learning. After reading each item in turn, you should have a better grasp of machine learning and be able to experiment with it. In this article, I am going to present you linear regression, one of the simplest models in machine learning.
IBM defines machine learning as a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. Murphy on his book Machine Learning states that the goal of machine learning is to develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest.
The focus on increasingly complex models and application in practical prediction is what sets machine learning apart from statistics. However, statistical inference is the process of using statistics as a tool to draw conclusions from data. Under certain presumptions, it uses statistical guarantees.
The image above describes perfectly the differences between machine learning and traditional programming. In traditional programming, you would provide the input and the rules (the code) in order to get the answers. For example, you would create a function that says sumNumbers(int a, int b)
. First of all, you have the input which are the integers a
and b
. Then, you wrote the rule which is a+b
. Machine learning does not work this way. It uses input and answers to find the rules. As written above, machine learning is able to detect patterns in data. If we were to apply the previous example to machine learning, we would have to build a dataset of inputs and answers, feed that dataset to one of the machine learning algorithms (or many) and finally, get the rules.
There are three types of machine learning problems.
Linear regression is a supervised learning algorithm specifically designed for regression tasks, where the goal is to predict a continuous outcome variable. It models the relationship between the input features and the target variable by fitting a linear equation to the observed data. Unlike classification tasks, where the objective is to predict a categorical outcome, linear regression is focused on estimating the relationship between variables to make continuous predictions.
The equation for building this model is
\hat{y} = f(x) = B_{1}x + B_{0}and its visual representation is displayed below.
B_{0} is known as the interceptor, while B_{1} is called the slope because it determines the angle of the blue line below.
The goal is to fit the model to the data, which means that we must find B_{0} and B_{1} such that f(x_{i}) \approx y.
A loss function (or cost function) is a measure of how well a machine learning model is performing based on its predictions compared to the actual values (ground truth) of the target variable. The goal during the training of a machine learning model is to minimize the value of the loss function.
In linear regression, the loss function used is Mean Squared Error (MSE). There are multiple reasons why this loss function is used with linear regression. Since the goal of linear regression is to find the line that minimizes the sum of squared differences between the predicted values and the actual values, MSE directly quantifies this objective by penalizing larger errors more heavily. Additionally, minimizing MSE is equivalent to maximizing the likelihood of the observed data under the assumption of normally distributed errors.
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2If we were to break down MSE, we would say that it firsly calculates the square difference between the ground truth and the predicted value. It does this for all n data in our dataset and at the end, calculates the sum of all these squared differences. Finally, it divides the sum by the total number of data calculating the mean. And that is Mean Squared Error.f
Let’s consider a slope only function. We have a function that looks like this
\hat{y}= f(x_{i}) = B_{1}x_{i}For such a function, the MSE would look as follows
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i – B_{1}x_{i})^2If we were to plot it, the image below displays how it would look like.
Our goal is to identify the minimal that results in the least amount of loss. The lowest point on the curve is the one that results in the minimum loss. Our model would best fit the data if it had the lowest loss. How could we find the minimum parameter B_{1} that produces the lowest \text{MSE}? By simply minimizing its derivative function.
You found the answer! We find the minimum of a function (i.e. MSE) by using an optimization function called gradient descent. The main idea behind gradient descent is to iteratively move towards the minimum of the function by adjusting the parameters in the direction of the steepest decrease in the function.
Here’s a simplified explanation of how gradient descent works:
The update rule for the parameter (B_{1}) in each iteration is given by:
B_{1}=B_{1}-\eta*\partial L(B)/\partial Bwhere \eta is the learning rate.
If the learning rate is too large, the steps will be bigger and we could jump back and forth on the other side of the curve. Consequently, we would miss the global minimum. On the other hand, if the learning rate is too small, the function would take longer to converge. We would take tiny steps to the global minimum, but the time spent would be much higher.
During each iteration, you can evaluate. If the loss function is lower than a tolerance value you specify, you might decide to stop and accept the value of B_{1} as the best value.
However, we could come across a non-convex function.
A non-convex function is a mathematical function for which the shape of the graph includes at least one concave (curving downward) region. In simpler terms, a function is non-convex if there exists at least one interval where, if you draw a straight line between any two points within that interval, the line lies below the graph of the function.
Finding the global minimum in such a function would be more difficult, since we could get stuck in a local minimum. Luckily, linear regression is a convex function and we can use the gradient descent to solve the function.
Going back to our function with two parameters, interceptor and slope. When we had a slope-only function, we had to find the global minimum by constantly updating the slope. Now that we have two parameters, we have to update both of them to find the global minimum. Said in simple terms, in order to go in the right direction, we must calculate the partial derivatives with respect to any of the two parameters. Afterwards, we will draw the conclusions in the same way we did with the slope-only function.
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i – B_{1}x_{i}-B_{0})^2When we calculate the partial derivative with respect to B_{1}, we keep the B_{0} unchanged.
\frac{\partial}{\partial B_1} \text{MSE} = -\frac{1}{m} \sum_{i=1}^{m} 2(y_i – (\hat{B}_1x_i + B_0))x_iOn the other hand, when we calculate the partial derivative with respect to B_{0}, we keep the B_{1} unchanged.
\frac{\partial}{\partial B_0} \text{MSE} = -\frac{1}{m} \sum_{i=1}^{m} 2(y_i – (\hat{B}_1x_i + B_0))The partial derivatives tell us in which direction to go to minimize the loss function. During each iteration, we update both parameters until we find the global minimum.
In this section, we are going to look at how we can use Linear Regression in ScikitLearn.
First off, 442 patients’ measurements of ten physiological variables (age, sex, weight, blood pressure) and a one-year disease progression indicator comprise the diabetes dataset.
from sklearn import datasets
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
Using the diabetes dataset’s first attribute, we will create a straightforward plot to illustrate our point. It is important to see the appearance of our data.
import matplotlib.pyplot as plt
plt.scatter(x = diabetes_X[:, 0], y = diabetes_y)
We must divide the data into training and test sets before we can begin training our model. We will use 20% of the data as a test set and the remaining portion as our training set.
from sklearn.model_selection import train_test_split
diabetes_train_X, diabetes_test_X, diabetes_train_y, diabetes_test_y = train_test_split(diabetes_X[:, 0].reshape(-1, 1), diabetes_y, test_size=0.2)
It’s time to fit our linear regression model.
from sklearn import linear_model
lr = linear_model.LinearRegression()
lr.fit(diabetes_train_X, diabetes_train_y)
Our model learnt the best intercept and slope for minimizing the MSE loss. Let’s plot the line. To draw a line we need to know two points. That’s why I have picked x to be -0.1 and 0.1.
x = [-0.1,0.1]
y = [lr.coef_ * i + lr.intercept_ for i in x]
plt.scatter(x = diabetes_X[:, 0], y = diabetes_y)
plt.plot(x,y, c='r')
For me, it resulted to be approximately 6193.9
.
import numpy as np
np.mean((lr.predict(diabetes_test_X) - diabetes_test_y)**2)
If you have any questions about linear regression in machine learning, feel free to write them here or contact.
Consider sharing and donating.
Thank you!