Understanding gradient descent

In the world of machine learning, our goal is to minimize the cost of making incorrect predictions and gradient descent is one of the methods that helps us to achieve that goal.

Mathematically, gradient descent is defined as an iterative optimization algorithm that finds the local minimum of a differentiable function. Now, let’s try to break down this definition to understand how gradient descent works.

What is a differentiable function?
A function whose derivative exists at each point in its domain.

What is a derivative?
The slope of the tangent at any point of the function is called the ‘derivative’ of the function at that point. Slope of any line is defined as the ratio of change in y over change in x for any two points on the line.

Hence, slope of the X-axis (a horizontal line) is 0 whereas, slope of the Y-axis (a vertical line) is not defined. Hence, a differentiable function cannot contain a break, corner or a cusp, and should have a non-vertical tangent line at every point in its domain.

Fun fact: Derivative of a multi-variable function with several inputs but a single output is called the ‘gradient.’ It is denoted by ∇f . For gradient, we use the concept of partial derivatives. I recommend watching this video if, like me, visualizing helps you understand concepts better too. 

What does local minimum mean?
The smallest value of a function for a given interval of its entire domain is called its local minimum for that specified interval whereas, the smallest value of the function over its entire domain is called the global minimum of the said function.

What does an optimization algorithm do?
Optimization algorithms are often iterative procedures which start at an initial point x0 and generate a sequence {xk} of iterates that converge to a solution. What is meant by solution differs from algorithm to algorithm.

What is the solution of the optimization algorithm used in gradient descent?
Often, the cost function is neither purely convex or concave. Its more like this

But for pedagogical purposes, we will assume that our cost function looks like this

where A is our starting point, and B is the point at which the value of our function is minimum.

Let f be the differentiable function denoted by J(w,b). Then gradient descent is the unconstrained minimization problem, where we seek a point on the function such that ∇f = 0

How does the algorithm locate the point where f is minimum?
The general idea of this algorithm is to start with a random point (w0 , b0) on our function f and then update it to descend the function till we find its minimum.
To locate the point, we need to specify the direction in which the algorithm should look for it, the step size for every iteration which depends on the learning rate (α), and the update rule for generating new iterates.

  • Direction: We start with point A on our function f (refer Figure 4). Then the algorithm looks in every possible direction and moves along the direction of steepest descent using the directional derivative. The directional derivative in direction u ( a unit vector) is the slope of the function f in direction u, which evaluates to:

Screen Shot 2020-06-13 at 7.30.30 PM

Since, u is a unit vector and cosθ ranges between -1 and 1, this is maximized when cosθ = 1 and minimized when cosθ = -1. In other words, we can find the minimum value of  by moving in the direction of the negative gradient. 

These videos (1 & 2) provide a great explanation about the relationship between directional derivative and gradient.

  • Learning Rate: This hyperparameter controls the rate at which our algorithm learns. If this is too small, it takes the algorithm too many steps to converge whereas, if it is too big, it might just keep bouncing along the ridges of the function and never converge to the minimum.
Figure 5
Credits:https://srdas.github.io/DLBook/GradientDescentTechniques.html

Often, the learning rate is set adaptively such that it is large initially and becomes smaller as it gets closer to the minimum. This process is called annealing the learning rate.

  • Update rule: The rule used to generate the new iterate is

This algorithm will keep updating the iterate till the value of the gradient becomes equal to 0 or is very close to 0.

Figure 6

Conclusion:

The goal of this article was to explain the intuition behind gradient descent. This method is used in a number of machine learning techniques like logistic regression or neural networks. Moreover, there are several types of gradient descent algorithms like batch or stochastic gradient descent and I hope to cover them in my future articles.

%d bloggers like this: