The Magic of Machine Learning: Gradient Descent Explained Simply but With All Math

Vitomir Jovanović
ITNEXT
Published in
8 min readMar 3, 2022

--

With Gradient Descent Code from the Scratch

Introduction

The Gradient Descent algorithm is the essence of many machine learning algorithms especially in neural networks as well as in any prediction task. This algorithm is used for many AI applications, from face recognition to other computer vision products, for different predictions, both for predictions of the continuous target (i.e. regression, e.g. price) or categorical one (i.e. probability for some outcome).

In the gradient descent algorithm, the parts are the gradient, which is, let’s say simply the relationship of change between two parameters and descent, some point in this relation which minimize some desired parameter. But what do we want to minimize? The Cost Function. What is the cost function?

The Cost Function

In task with the continuous target cost function is the loss error function (usually called Mean Squared Error), while in task with the categorical target the loss function is cross entropy. We will focus here on the first one.

With the model m, we get some prediction value y’ for each row (data point). For each of our data points, we can see how this prediction differs for each data point from true value (y).

J = (∑(y’-y)²)/n

where n is the number of data points (i.e. size of our sample)

The Goal of the Gradient Descent is to Find Optimal Weights

Let’s say we have one feature (predictor) — x. Our prediction value (y’) is dependent on that prediction but we must find the weight (𝛉) that gives the prediction value:

y’ = 𝛉⁰ + 𝛉*x

This formula is applied for each data point (i.e. if we want to predict the student achievement based on hours of learning, than for each student). But the optimal 𝛉 will be same for every data point (student, i) because we can rewrite our cost function as:

The cost function for weight 𝛉.

Because in this example we have only one feature the derivative of the total cost function and partial derivative of the cost function will be the same, but in reality, usually the total derivative of the cost function is made from all partial derivatives. But what is the derivative? How we will get to the optimal 𝛉 which will give us the smallest possible error function (in this case MSE)? By derivative!

Derivative of The Function

The gradient is basically the slope of some function, the measure of the change in the function. In order to get to the gradient, we must calculate derivatives for both parameters, in this case for cost function (J) and (𝛉). Basically, we want to find the value of 𝛉 where the error function is the smallest.

The slope (gradient) of the function is the ratio between 𝛥𝛉 and 𝛥J (𝛥𝛉/𝛥J). We want the smallest possible gradient in respect to 𝛥J because in this spot we will get to the minimum value of J which we want, because we want the error function to be as small as possible. In this simple example, this could be like this, but the rule is similar even for more complicated cases.

More generally, derivatives of linear and nonlinear functions can be calculated with the following formula:

Calculation of derivative 𝛥x

where df(x)/dx is the notation we will use for the derivative of the function with respect to x. We can note that this is true when 𝛥x approaches 0. On the other side, (f(x)+𝛥x ) — f(x) corresponds to the 𝛥y for a given x and 𝛥x.

The Gradient of Constant, Linear and Nonlinear function.

Simple linear function: y = 3x + 1

The derivative of the linear function is always equal to the regression weight, in this case, 3. This means that y changes 3 times faster than x, and this value of the gradient is constant, does not change, because we have linear function, i.e. linear, constant change of y dependant on x.

The calculation of the gradient for the linear function above (y = 3x + 1).

For nonlinear quadratic function, we will finally have the change of the gradient (slope):

Blue — original function; Orange — first derivative; Red — 0 gradient
Derivative of the x².

Above you can see that slope will depend on the value of x and it will change constantly, dependent on x. So, we can use the gradient as a sign in which direction to go in order to minimize the y (or the cost function). When x is negative, the slope is negative. The value is large and negative when x is large and negative and increases until it reaches 0 when x =0. Thus, we can see, just looking at the plot, that there is a relation between x and the slope (gradient). In ML problems, we want to find the value of 𝛉 that minimizes the cost function, i.e. the minimum gradient. Let’s now see what we are really interested in: the derivative of the cost function.

The Derivative of The Cost Function

Let’s differentiate this cost function J(𝛉). It can be written dJ(𝛉)/d𝛉, meaning that the differentiation is done with respect to 𝛉:

Derivative of the cost function: first equation.

We can get out from the derivative because of the multiplication by constant rule. The sum can also be extracted from the derivative using the sum rule.

Next step

To differentiate this function, we need to separate it as the combination of two functions. So, let’s create the following function:

Transformation for each data point.

Solving further, using the chain rule for calculation for each data point, we will calculate the derivative of dJ(𝛉) in respect to u(𝛉,i):

In respect to each data point (du).

Now it gets a little complicated, but the logic is clear and simple. Full equations you can see here.

Finally, we get:

Final derivative of cost function.

This function dJ(𝛉)/d𝛉 is in the essence of our attention because it takes a value of the parameter 𝛉 we want to optimize and returns the slope of the tangent of the cost function for this value of 𝛉. The slope tells us the direction to take to minimize the cost.

Programming Gradient Descent from The Scratch

Now we will make a simple function that will implement all this for Linear regression. It is far way simpler than you think!

Let’s first simply write the calculation of error, i.e. the derivative of the cost function:

def calculate_error_j_(theta1, x_i, y_i):
return ((theta1 * x_i) - y_i) * x_i

It takes some weight, predictor feature and target. Simple.

Ok, but now we must mention a couple of new concepts. This slope calculation can be done at different intervals, i.e. the ‘jump’ in 𝛉, which can be regulated by a parameter which is called learning rate. This parameter basically regulates how big jumps in 𝛉 will be done for observing the slope change. It has its price — the higher the learning rate those jumps will be bigger and computation faster but it can overshoot and overseen the optimal spot, minimum of the cost function, so it can and with not so accurate solution. Also, these updates with different jumps are done through iterations that are called epochs. Some advanced gradient descent algorithms such as ADAM, RMSProp and others, vary the learning rate, making it smaller when the slope is smaller, so making some fine tuning around optimal 𝛉 value(s).

Ok, then we are ready and we will create a function that will in several different steps calculate and update the weights.

def linear_regression(x, y, b=0, b0=0, epochs=1000, learning_rate=0.001):
N = float(len(y))
for i in range(epochs):
y_predicted = b0 + (b * X)
cost = sum([data**2 for data in (y-y_predicted)]) / N
b_gradient = -(2/N) * sum(X * (y - y_predicted))
b0_gradient = -(2/N) * sum(y - y_predicted)
b = b - (learning_rate * b_gradient)
b0 = b0 - (learning_rate * b0_gradient)
# print(f'b0:{b0}', f'b:{b}')
return b, b0, cost

This function takes the real x (feature predictors) and y (target) values, also number of epochs and learning rate, and then makes iterations and 𝛉 updates at the end of each epoch. 𝛉 is denoted as b in the code, while b0 is intercept in the linear function. First is calculated the predicted value (y_predicted), then cost function (MSE), and then gradient according to the formula above. Then the b is updated by this gradient and the intensity of this update is regulated by the learning rate. Then the process repeats itself for the new epoch, until the end of the number of epochs, while in this case we expect that the gradient will be 0 until now, so it will reach the optimal value of b. This is the revealed secret of Machine Learning, although it could be far more complicated in more complex examples.

So, let’s make some fake data and see the convergence of this simple algorithm to the optimal value of 𝛉. We will see him moving for different number of iterations (epochs).

Some fake data first:

import numpy as np
import matplotlib.pyplot as plt
predictor = np.random.randn(1000)
target = [2*x + 4 + 0.02*np.random.uniform(1000) for x in predictor] # we create some fake data
plt.figure(figsize=(4,4))
plt.scatter(predictor, target);
Fake simple data.

Now we will plot the predicted value for the different epoch size and we will see how the 𝛉 slowly converge to the optimal value (red color):

plt.figure(figsize=(8,8))
plt.xlabel('x')
plt.ylabel('y')

epochs = {'y':10, 'orange':20, 'blue':30, 'green':100, 'red':1000}

for color, epoch in epochs.items():
b, b0, cost = linear_regression(predictor, target, b=0, b0=0, epochs=epoch, learning_rate=0.01)
predicted = [b0+b*x for x in predictor]

plt.scatter(predictor, target);
plt.plot(predictor, predicted, color=color, label=f'fitted line - prediction for {epoch} epoch')
plt.legend(loc='best')
plt.xlabel('x')
plt.ylabel('y');
Conversion of the weight 𝛉 to the optimal value.

This is a very simple but in essence, very powerful algorithm that ensures finding the optimal solution in situations where other brute force or other approaches would be impossible for the very high number of data and features. So, ML can be seen as optimisation. Hope this will help you in your learning.

--

--

PhD in Psychometrics, Data Scientist, works in tech, from Belgrade, Serbia