Linear Regression in Python: The Naive Way

A short guide on how to implement linear regression using libraries.

Hola mates, long time no C. Yeah, I am just a Python guy and I don’t like C/C++ too much simply because they aren’t that handy. A programming language is a tool that should be easy to use and should allow you to implement and do stuff and not waste your time configuring and reading the manuals.

So today we will see how to implement linear regression in Python. If you went through my older posts you know it’s something the stats guys use all the time. A linear regression model is so simple that even calculators have it now. I will present two approaches, the naive one and the bowtied one. The bowtied way is to learn and the naive one is for practical purposes. But before this, we need some data. Now there are lots of datasets available for this, but then I might have to spend time cleaning and organizing it which will make this post kind of boring… So, we will use randomly generated data!

Step 1: Generating Data

I will be showing you the most basic form of linear regression, that is, function :

https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fb51e01d4-a8c2-462a-a1c5-3c5a0a9a9671_450x69.png

Where `x` is a set of points and `y` is the corresponding value associated with that `x`. If you recall, machine learning is about finding a relation between two sets of values, and in this case, this relation is established by this function only by the values `m` and `c`. Easy peasy.

So here’s what we are gonna do, we will get some random values of `x`, and for the corresponding values of `y` we will just keep 2 times x, i.e. `2*x`. Later we can verify that one of the values i.e `m` is also 2. So let’s begin!

We import the NumPy library which stands for numerical python, it was developed in 2005 by Travis Oliphant and is used for whole hosts of numerical operations and experimentations. It’s a mathematician’s favorite.

We have used the `random` submodule of NumPy which will help us create an array of any dimension that we want. The `np.random.rand` function takes in a variable number of arguments as dimensions of the output that we want. Think of this as a vector, it needs to have dimensions.

We want at least 1000 values of x. Since we wanted a series of values of `x` we will simply put 1000 in the argument.

Next, we generate `y`, as promised it will be 2 times `x`. So here it is:

So, what next? We have our data, the `x` and the `y`. We will split this data into training data: the data we will use to train the model and the testing data: the data we will use to test the model to gauge its accuracy. Since we have 1000 data points, we will use 90% of this data for training and keep aside 10% for testing. So let’s go!

So, we have 900 data points both for `x_train` and `y_train` and 100 data points for both `x_test` and `y_test`, the names are self-explanatory. I have done something called `list-slicing` or in this case, `array-slicing` to be precise, what that is: you will have to explore that yourself buddy, not everything is going to be spoon-fed.

Step 2: Creating a Linear Regression Model

So, we have our training data ready, our testing data ready. We need to just create the model. Since this article will focus on the naive a.k.a the easy way, we will use a library to import the pre-built model and then simply train this model with the data. The library that I am going to use is called Scikit-Learn and is a very popular library used for statistical and machine-learning problems.

We import the LinearRegression class and instantiate an object of this class named naive_model, which we will use to fit the data. To train the model, we will simply call the `fit()` method of this class object with the `x_train` and `y_train` passed to it. However, before doing that, we need to change the format of our data. the `fit()` method expects the `x` and `y` to be of the shape (n_samples, n_features),i.e (number of samples, number of features) since our data has just one feature, i.e we only have one `x` the shape of `x` is (1000,). We need to change this to (1000,1) which means 1000 samples and 1 feature. So, we do a reshape(-1,1). The diagram below shows this:

Step 3: Training and Testing the Model

We now fit and train our model.

That’s it, this simple step trains our model, now let’s see how the model predicts the data. We will call the predict() method and compare the output with one of the test data points.

We predict the value for the 45th data point in `x` and the output that we get is 1.64366699. This means the model predicts that the value of `y` for the 45th value of `x` is 1.6436. Notice, I had to call reshape(-1,1) before feeding the value.

Now let’s see what the actual value of `y` is for this 45th data point.

Hoo-yah! it’s the same! The model predicted this value with 100% accuracy. Now doing this for all the data points would be tiresome wouldn’t it be? So we introduce metrics, metrics help you assess how good your model is, a very simple metric is `mean squared error` known as MSE in short. It simply calculates the difference between the output of the model and the expected output and then takes its square, does this for all the points and then takes an average of it. So, what would a good MSE score for a model be? Think anon. Since we want our model to be good, the difference between the predicted and expected values should be close to zero(and ideally zero) and hence the average should also be close to zero(and ideally zero). Let’s see the MSE of our model.

That’s a very small number, don’t forget the e-31 which means it is 10^-31 times. I better convert this to an integer.

Hoo-yah, it’s 0! So our model is 100% accurate. But you should be worried if this comes out on real-world data because it is rare that you will find data with such a simple relation. Real-world data is all complex and mushy-mushy.

Let’s see the values of `m` and `c` that we need for the linear equation. We get the `coef_` attribute of the model, where sci-kit-learn stores the coefficients.

It’s 2, as expected since we got our `y` by 2*x. Also, notice we don’t have a `c` it means it is 0. This suggests that the line passes through (0,0) on the graph. Let’s verify this, by plotting it.

Hoo-yah! Matplotlib is a tool that we can use to plot graphs and whatnot. I plotted the (0,0) separately to show that the line crosses the origin(0,0).

Conclusion

So that’s it guys, that’s all for the naive implementation. You will use this for practical purposes. However, to gain an in-depth understanding of the algorithms, we will see how to implement all this from scratch using NumPy and Python. Till then, stay tuned.

Scroll to Top