The post Gradient Descent appeared first on ML-DL.

]]>We learned in the last post about Linear Regression. We concluded with a cost function that we needed to minimize. Today we will see how we minimize this cost function.

To recap, the cost function was :

Here, hđťšŻ(x) is the linear regression equation that we discussed earlier( y= mx+c).

here m is represented as theta. One can write the equation as

For multivariate linear regression, it would then become

and the corresponding hypothesis function would be then

The c is often called

so we write the above equation as :

Where x0 is 1.

So the hypothesis function that we are working on is :

And the corresponding cost function is :

The superscript *i *just denotes the different xâ€™s that we have, that is the different data points.

Knowing these notations just helps you in case you want to further dive deep into machine learning because there is a lot of maths out there. Just understanding what the various notations mean can help a lot at times.

I hope you guys donâ€™t find trouble understanding this summation part “:

But in case you do, it just means that we are calculating the square of the difference between **the value of the output of the hypothesis function for input x **and **the value of y that we have for x in reality.** I hope that made sense. We calculate the said squared difference for all the values of x that we have in our dataset.

Now comes gradient descent. We need to minimize this squared error. We need to find the optimal value of theta for which this cost function is the minimum.

Look at the cost function closely:

Itâ€™s a function in theta(s) because we are going to feed the xâ€™s and yâ€™s to it, the only variable that would remain would be theta(s). We can plot the graph of this function and see how the thetas are affecting them. These thetas are just variables and one could name them *a *and *b *but theta is just the convention.

Since we have two independent variables( the thetas) and one dependent one (the output of the cost function ) it will be a 3d plot. like this :

This looks like a valley. Now meditate on this, we are looking for that đťšŻ0 and đťšŻ1 for which J(đťšŻ0,đťšŻ1) lies in the lowest region of this valley. Try to visualize this by looking at the image above.

Now comes the interesting part, we differentiate the above equation with respect to the variables, one at a time. Read that again. With respect to the variables, one at a time, this has a special term called partial derivative. We differentiate the equation with respect to one variable and assume that the other variables are constant.

This helps us see how the function varies with respect to that variable, keeping everything else constant.

So yes, this is what we do here. Differentiate the equation with respect to one variable( theta). Now in this case, if we freeze đťšŻ0, i.e assume that itâ€™s a constant, the function would get reduced to a single variable one, and then differentiating it at a point would give us the slope of that function. Something like this:

Look at that slice, our entire 3-D graph would reduce to that slice, that is a 2-D graph, why? because one of the dimensions is lost, we have assumed it is constant. Now, if you have taken any calculus course, then everything above might seem too boring and baby stuff, but any of my readers who arenâ€™t familiar with calculus might be getting something from this post. Now, whatâ€™s next? We calculated the partial derivative for đťšŻ1, whatâ€™s next?

We pick a random value of đťšŻ1 and see what value we get for the partial derivative. This value is nothing but the slope at that point. This picture will tell you better:

Look at that sliced curve, thatâ€™s the function we get when we work with just one variable and assume the rest to be constant. Here the second variable (represented by the y-axis) has been taken as constant, see its value is fixed to 1 for the entirety of the slice.

Look at that tangent, it touches the sliced function at just one point, when we differentiate the slice function it gives us the function for slope. When we input a value of đťšŻ1, it gives us the value of the slope of the tangent at that point.

The value of the slope has two important properties, the sign ( negative or positive) and the magnitude. The sign tells us whether the function is going up ( increasing) or going down (decreasing), and the magnitude tells us how fast or how slow.

So coming back to our 3-D function :

Since we assumed đťšŻ0 to be a constant, try to imagine a slice that takes one constant value of đťšŻ0. It will look like this again :

Next, take the derivative and see the slope at any value of đťšŻ1, for the above graph, what do you conclude? We see, that as the value increases, the slope goes downhill and becomes steeper and steeper. So, what does it tell you? It tells you that if you keep increasing the value of đťšŻ1, you will keep going downhill, great!! thatâ€™s what we want, donâ€™t we?

Recall, we want those values of thetas for which we will land at the bottom of the 3-D graph, and going along a path that is going downhill, guarantees that!

We do the same for đťšŻ0. But why? for the above image, isnâ€™t just working on đťšŻ1 enough to get you down to the valley? Well yes, in the above case it is, but not in a case like this:

Go ahead, freeze đťšŻ0, to say, 0.7 above, visualize the slice, try to traverse down, and see where you reach. Do this again, but this time for 0.3, see the difference?

So, we canâ€™t work with one variable all the time and freeze the others, that will take you down for sure but it wonâ€™t be the bottommost part, it will be the bottommost part for that slice of course. So what do we do?

We traverse down the valley simultaneously :

Here is the algorithm for updating

We are updating the thetas by subtracting this term from it:

The term :

is nothing but the partial derivative of the cost function with respect to one variable at a time ( see đťšŻj at the bottom). It gives the slope which has a magnitude and a direction. If it is negative, that means if we increase the current value of đťšŻj, we are going downhill, which is good, so we need to increase the value of đťšŻj, therefore subtracting this term will increase the value of đťšŻj because minus and minus make a plus. Similarly, if the slope is positive, we are going uphill, so reducing the value of đťšŻj makes sense.

But then, whatâ€™s the Î± term? Itâ€™s called the learning rate, it is a constant value that decides how much we want to increase or decrease the theta value. Increasing its value will result in faster updates and we might reach the bottommost point faster, but not necessarily.

Look at this :

This is an example of a small learning rate, like 0.001, it will result in a small and steady traversal towards the bottom point. But of course, it will take more time, but it will guarantee that you will reach the bottommost point.

Now, what if we have a large learning rate?

If we have a very high learning rate when we update our thetas, we might *miss* the bottommost point and go uphill, and then again go back and forth, missing the bottommost point each time, thus being stuck in a loop.

So you get the idea now, that each time we update the weights, we call it an iteration. We need to keep iterating, till we reach the bottommost point.

Once we reach the bottommost point, that is after a certain number of iterations( theta updates) the value of the cost function would stop decreasing, and wonâ€™t improve further,

Now you might understand what guys mean when you see them talking about â€śtrainingâ€ť their machine learning models and talking about â€śit takes timeâ€ť haha.

Also, you get why itâ€™s called â€śgradient descentâ€ť. Gradient just means how steep the slope is at a point in the graph, that is, the magnitude of the slope and descent means we are descending to the bottommost point.

Thatâ€™s all I had for you to know about gradient descent. I know this might seem quite overwhelming at first. But try to visualize and you will get there.

Also, many of you might not have any background in calculus, for those, I wonâ€™t suggest you take a calculus course. Instead, I will suggest this youtube channel which is simply epic for explaining math stuff 3Brown1Blue. The visualizations that this guy will present will simply blow your mind. Do ensure to check this out.

There is more to gradient descent, I canâ€™t cover every aspect in a single post, and also I donâ€™t want to overload you with so much information which is not going to immediately help you. I believe itâ€™s a detrimental way of educating. Later, when the need arises I will keep giving out more info on this topic.

The post Gradient Descent appeared first on ML-DL.

]]>The post Linear Regression in Python: The Naive Way appeared first on ML-DL.

]]>Hola mates, long time no C. Yeah, I am just a python guy and I donâ€™t like C/C++ too much simply because they arenâ€™t that handy. A programming language is a tool that should be easy to use and should allow you to implement and do stuff and not waste your time configuring and reading the manuals.

So today we will see how to implement linear regression in python. If you went through my older posts you know itâ€™s something the stats guys use all the time. A linear regression model is so simple that even calculators have it now. I will present two approaches, the naive one and the bowtied one. The bowtied way is to learn and the naive one is for practical purposes. But before this, we need some data. Now there are lots of datasets available for this, but then I might have to spend time cleaning and organizing it which will make this post kind of boringâ€¦ So, we will use randomly generated data!

I will be showing you the most basic form of linear regression, that is, function :

Where `x` is a set of points and `y` is the corresponding value associated with that `x`. If you recall, machine learning is about finding a relation between two sets of values, and in this case, this relation is established by this function only by the virtue of the values `m` and `c`. Easy peasy.

So hereâ€™s what we are gonna do, we will get some random values of `x` and for the corresponding values of `y` we will just keep 2 times x, i.e `2*x`. Later we can verify that one of the values i.e `m` is also 2. So letâ€™s begin!

We import the NumPy library which stands for numerical python, it was developed in 2005 by Travis Oliphant and is used for whole hosts of numerical operations and experimentations. Itâ€™s a mathematicianâ€™s favourite.

We have used the `random` submodule of NumPy which will help us create an array of any dimension that we want. The `np.random.rand` function takes in a variable number of arguments as dimensions of the output that we want. Think of this as a vector, it needs to have dimensions.

We want at least 1000 values of x. Since we wanted a series of values of `x` we will simply put 1000 in the argument.

Next, we generate `y`, as promised it will be 2 times `x`. So here it is:

So, what next? We have our data, the `x` and the `y`. We will split this data into training data: the data we will use to train the model and the testing data: the data we will use to test the model to gauge its accuracy. Since we have 1000 data points, we will use 90% of this data for training and keep aside 10% for testing. So letâ€™s go!

So, we have 900 data points both for `x_train` and `y_train` and 100 data points for both `x_test` and `y_test`, the names are self-explanatory. I have done something called `list-slicing` or in this case, `array-slicing` to be precise, what that is: you will have to explore that yourself buddy, not everything is going to be spoon-fed.

So, we have our training data ready, our testing data ready. We need to just create the model. Since this article will focus on the naive a.k.a the easy way, we will use a library to import the pre-built model and then simply train this model with the data. The library that I am going to use is called Scikit-Learn and is a very popular library used for statistical and machine-learning problems.

We import the LinearRegression class and instantiate an object of this class named naive_model, which we will use to fit the data. To train the model, we will simply call the `fit()` method of this class object with the `x_train` and `y_train` passed to it. However, before doing that, we need to change the format of our data. the `fit()` method expects the `x` and `y` to be of the shape (n_samples, n_features),i.e (number of samples, number of features) since our data has just one feature, i.e we only have one `x` the shape of `x` is (1000,). We need to change this to (1000,1) which means 1000 samples and 1 feature. So, we do a reshape(-1,1). The diagram below shows this:

We now fit and train our model.

Thatâ€™s it, this simple step trains our model, now letâ€™s see how the model predicts the data. We will call predict() method and compare the output with one of the test data points.

We predict the value for the 45th data point in `x` and the output that we get is 1.64366699. This means the model predicts that the value of `y` for the 45th value of `x` is 1.6436. Notice, I had to call reshape(-1,1) before feeding the value.

Now letâ€™s see what the actual value of `y` is for this 45th data point.

Hoo-yah! itâ€™s the same! The model predicted this value with 100% accuracy. Now doing this for all the data points would be tiresome wouldnâ€™t it be? So we introduce metrics, metrics help you assess how good your model is, a very simple metric is `mean squared error` known as MSE in short. It simply calculates the difference between the output of the model and the expected output and then takes its square, does this for all the points and then takes an average of it. So, what would a good MSE score for a model be? Think anon. Since we want our model to be good, the difference between the predicted and expected values should be close to zero(and ideally zero) and hence the average should also be close to zero(and ideally zero). Letâ€™s see the MSE of our model.

Thatâ€™s a very small number, donâ€™t forget the e-31 which means it is 10^-31 times. I better convert this to an integer.

Hoo-yah, itâ€™s 0! So our model is 100% accurate. But you should be worried if this comes out on real-world data because it is rare that you will find data which such simple relation. Real-world data is all complex and mushy-mushy.

Letâ€™s see the values of `m` and `c` that we need for the linear equation. We get the `coef_` attribute of the model, where sci-kit-learn stores the coefficients.

Itâ€™s 2, as expected since we got our `y` by 2*x. Also, notice we donâ€™t have a `c` it means it is 0. This suggests that the line passes through (0,0) on the graph. Letâ€™s verify this, by plotting it.

Hoo-yah! matplotlib is a tool that we can use to plot graphs and whatnot. I plotted the (0,0) separately to show that the line crosses the origin(0,0).

So thatâ€™s it guys, thatâ€™s all for the naive implementation. You will use this for practical purposes. However, to gain an in-depth understanding of the algorithms, we will see how to implement all this from scratch using NumPy and python. Till then, stay tuned.

The post Linear Regression in Python: The Naive Way appeared first on ML-DL.

]]>The post Linear Regression appeared first on ML-DL.

]]>In this post, we will look at a very simple machine learning algorithm which is actually the â€śhello worldâ€ť equivalent of programming languages.

So what is linear regression? If you are from a statistics background chances are you already know what it means. In statistics, linear regression is a tool that is used for finding a relation between a dependent variable and an independent variable. Itâ€™s that simple. Consider the equation below.

This is a linear equation that captures the relationship between the variables *y *and *x. *The variable *y* and the variable *x *are called dependent and independent variables respectively. The subscript *h *means it is our hypothesis function.

If you are given a set of data points and you think that the points are related to each other linearly, that is, there exists some equation *y=mx +c* which is able to represent the dependent and independent variables, if not exactly then approximately, you would like to apply the linear regression algorithm and try seeing if it is a good fit.

In the above picture, there is a set of data points. We have fit a line through the data points. Notice that the line does not fit all the points, in fact, it just fits a very small number of points, however, it captures the ** trend** of the data. Which is somewhat linear.

Now how do we find this line? More mathematically, how do we find the parameters *m *and *c* for the linear equation we discussed above?

Itâ€™s simple. We choose a random line, we calculate the sum of the squared distances of points from this line. We then try to minimize this sum, i.e we find parameters for which this sum is lowest. Why does the sum have to be the minimum for the best parameters? For this, you must first ask what are the best parameters.

The best parameters are the ones for which 1. We get a line that covers most of the points and 2. for those points, it does not cover, we want it to be as close as possible to the line.

If you think about the above two points, you will realize that both requirements can be satisfied if the sum of the squared distances is minimum.

In particular, we will try to minimize the average of the sum of squared distances. We take the average so that our model doesnâ€™t depend on the number of data points.

There are some geometrical and linear algebra reasons why we use squared distances instead of absolute distances, wonâ€™t cover that much in detail here, if you are interested, you can refer to them here.

We write the above-mentioned metric in the form of a function and call it the loss function. See this:

Now the goal of linear regression is to reduce the output of the cost function as much as possible i.e we want to minimize this function.

If everything makes sense here, letâ€™s move to the process of minimizing this function.

Now if you have some experience in maths, then you might know that there exists something called analytical solutions which aims at transforming the given problem into a well-known form and then calculating its solution.

But we wonâ€™t be doing that here, because that leads to some complex matrix inversion operations which are computationally expensive in the case of multivariate linear regression, i.e regression in which we have more than one *x*.

Next article we will see how to solve this problem.

The post Linear Regression appeared first on ML-DL.

]]>