The special ingredient to machine learning.
We learned in the last post about Linear Regression. We concluded with a cost function that we needed to minimize. Today we will see how we minimize this cost function.
To recap, the cost function was :
Here, h𝚯(x) is the linear regression equation that we discussed earlier( y= mx+c).
here m is represented as theta. One can write the equation as
For multivariate linear regression, it would then become
and the corresponding hypothesis function would be then
The c is often called
so we write the above equation as :
Where x0 is 1.
So the hypothesis function that we are working on is :
And the corresponding cost function is :
The superscript i just denotes the different x’s that we have, that is the different data points.
Knowing these notations just helps you in case you want to further dive deep into machine learning because there is a lot of maths out there. Just understanding what the various notations mean can help a lot at times.
I hope you guys don’t find trouble understanding this summation part “:
But in case you do, it just means that we are calculating the square of the difference between the value of the output of the hypothesis function for input x and the value of y that we have for x in reality. I hope that made sense. We calculate the said squared difference for all the values of x that we have in our dataset.
Now comes gradient descent. We need to minimize this squared error. We need to find the optimal value of theta for which this cost function is the minimum.
Look at the cost function closely:
It’s a function in theta(s) because we are going to feed the x’s and y’s to it, the only variable that would remain would be theta(s). We can plot the graph of this function and see how the thetas are affecting them. These thetas are just variables and one could name them a and b but theta is just the convention.
Since we have two independent variables( the thetas) and one dependent one (the output of the cost function ) it will be a 3d plot. like this :
This looks like a valley. Now meditate on this, we are looking for that 𝚯0 and 𝚯1 for which J(𝚯0,𝚯1) lies in the lowest region of this valley. Try to visualize this by looking at the image above.
Now comes the interesting part, we differentiate the above equation with respect to the variables, one at a time. Read that again. With respect to the variables, one at a time, this has a special term called partial derivative. We differentiate the equation with respect to one variable and assume that the other variables are constant.
This helps us see how the function varies with respect to that variable, keeping everything else constant.
So yes, this is what we do here. Differentiate the equation with respect to one variable( theta). Now in this case, if we freeze 𝚯0, i.e assume that it’s a constant, the function would get reduced to a single variable one, and then differentiating it at a point would give us the slope of that function. Something like this:
Look at that slice, our entire 3-D graph would reduce to that slice, that is a 2-D graph, why? because one of the dimensions is lost, we have assumed it is constant. Now, if you have taken any calculus course, then everything above might seem too boring and baby stuff, but any of my readers who aren’t familiar with calculus might be getting something from this post. Now, what’s next? We calculated the partial derivative for 𝚯1, what’s next?
We pick a random value of 𝚯1 and see what value we get for the partial derivative. This value is nothing but the slope at that point. This picture will tell you better:
Look at that sliced curve, that’s the function we get when we work with just one variable and assume the rest to be constant. Here the second variable (represented by the y-axis) has been taken as constant, see its value is fixed to 1 for the entirety of the slice.
Look at that tangent, it touches the sliced function at just one point, when we differentiate the slice function it gives us the function for slope. When we input a value of 𝚯1, it gives us the value of the slope of the tangent at that point.
The value of the slope has two important properties, the sign ( negative or positive) and the magnitude. The sign tells us whether the function is going up ( increasing) or going down (decreasing), and the magnitude tells us how fast or how slow.
So coming back to our 3-D function :
Since we assumed 𝚯0 to be a constant, try to imagine a slice that takes one constant value of 𝚯0. It will look like this again :
Next, take the derivative and see the slope at any value of 𝚯1, for the above graph, what do you conclude? We see, that as the value increases, the slope goes downhill and becomes steeper and steeper. So, what does it tell you? It tells you that if you keep increasing the value of 𝚯1, you will keep going downhill, great!! that’s what we want, don’t we?
Recall, we want those values of thetas for which we will land at the bottom of the 3-D graph, and going along a path that is going downhill, guarantees that!
We do the same for 𝚯0. But why? for the above image, isn’t just working on 𝚯1 enough to get you down to the valley? Well yes, in the above case it is, but not in a case like this:
Go ahead, freeze 𝚯0, to say, 0.7 above, visualize the slice, try to traverse down, and see where you reach. Do this again, but this time for 0.3, see the difference?
So, we can’t work with one variable all the time and freeze the others, that will take you down for sure but it won’t be the bottommost part, it will be the bottommost part for that slice of course. So what do we do?
We traverse down the valley simultaneously :
Here is the algorithm for updating
We are updating the thetas by subtracting this term from it:
The term :
is nothing but the partial derivative of the cost function with respect to one variable at a time ( see 𝚯j at the bottom). It gives the slope which has a magnitude and a direction. If it is negative, that means if we increase the current value of 𝚯j, we are going downhill, which is good, so we need to increase the value of 𝚯j, therefore subtracting this term will increase the value of 𝚯j because minus and minus make a plus. Similarly, if the slope is positive, we are going uphill, so reducing the value of 𝚯j makes sense.
But then, what’s the α term? It’s called the learning rate, it is a constant value that decides how much we want to increase or decrease the theta value. Increasing its value will result in faster updates and we might reach the bottommost point faster, but not necessarily.
Look at this :
This is an example of a small learning rate, like 0.001, it will result in a small and steady traversal towards the bottom point. But of course, it will take more time, but it will guarantee that you will reach the bottommost point.
Now, what if we have a large learning rate?
If we have a very high learning rate when we update our thetas, we might *miss* the bottommost point and go uphill, and then again go back and forth, missing the bottommost point each time, thus being stuck in a loop.
So you get the idea now, that each time we update the weights, we call it an iteration. We need to keep iterating, till we reach the bottommost point.
Once we reach the bottommost point, that is after a certain number of iterations( theta updates) the value of the cost function would stop decreasing, and won’t improve further,
Now you might understand what guys mean when you see them talking about “training” their machine learning models and talking about “it takes time” haha.
Also, you get why it’s called “gradient descent”. Gradient just means how steep the slope is at a point in the graph, that is, the magnitude of the slope and descent means we are descending to the bottommost point.
That’s all I had for you to know about gradient descent. I know this might seem quite overwhelming at first. But try to visualize and you will get there.
Also, many of you might not have any background in calculus, for those, I won’t suggest you take a calculus course. Instead, I will suggest this youtube channel which is simply epic for explaining math stuff 3Brown1Blue. The visualizations that this guy will present will simply blow your mind. Do ensure to check this out.
There is more to gradient descent, I can’t cover every aspect in a single post, and also I don’t want to overload you with so much information which is not going to immediately help you. I believe it’s a detrimental way of educating. Later, when the need arises I will keep giving out more info on this topic.