But What Are Word Embeddings? Read More »

The post But What Are Word Embeddings? appeared first on ML-DL.

]]>What is Natural Language Processing? It’s a vast field of science, that comes under the category of linguistics, it has been there even before the recent developments in machine learning and deep learning came into the picture. It comprises techniques like Stemming and Lemmatization – processes used to reduce words to their root form. Named Entity Recognition – identifying and classifying named entities such as person names, organization names, and locations in text, etc. Traditional NER techniques often rely on handcrafted rules and dictionaries to identify patterns and keywords associated with named entities and so on.

Here, at MLDL we would like to look more into the deep learning applications and advancements. That is working on language tasks that involve neural networks. However, we will delve into two types of word embeddings: Traditional/ Non-Deep Learning Based and Modern/Deep Learning Based.

Before the rapid increase in the applications of deep neural networks or machine learning, natural language tasks were still carried out. It used mathematical and statistical tools for this. There are statistical tools used for word embeddings too, and do not involve deep learning :

This is one of the most widely used methods and is used in conjunction with deep learning methods too. We won’t go deep into the math here, as it is quite a vast topic. The main idea is that we start with a term-document matrix, which in other words is a vocabulary*document matrix. Vocabulary here is the unique number of words on your task, and the document could be text, articles, paragraphs, etc, whatever base unit you choose. People mostly choose sentences.

The matrix, let’s say M*N matrix, represents the frequency of each term in the document. A note here, this matrix is mostly sparse, because you won’t expect each word to appear in a given document(assuming the document to be a sentence here).

The term-document matrix is then decomposed into three matrices using Singular Value Decomposition(SVD): U, Σ, and V^T. U represents the relationships between words and latent topics, Σ contains the importance of each topic, and V^T represents the relationships between documents and latent topics.

SVD helps us reduce the dimension to a certain number K which is small and compressed enough to capture the important features. The U matrix obtained, each row of this matrix represents the word embeddings of that vector. This summarises LSA and how it is used for getting word embeddings.

Glove stands for Global Vectors for Word Representation. It calculates the co-occurrence matrix which represents the co-occurrence of each word in a given corpus with other words. For example, if a cat and dog occur frequently together( meaning nearby ) in the given corpus, then their co-occurrence count would be high.

After we have calculated the co-occurrence matrix, we then again use SVD to factorize this into the terms U, Σ, and V^T. Following a similar approach as LSA, we get the word embeddings.

There are many other methods, but we will only discuss these two methods for traditional word embedding calculation. Now coming down to deep learning-based methods, there are quite a few popular methods :

There are two types of word2vec models. Both operate using neural networks. The first one that we will discuss here is CBOW or Continuous Bag of Words.

The** Continuous Bag of Words (CBOW)** model is one of the two learning architectures provided by the Word2Vec approach to generating word embeddings, developed by researchers at Google. The CBOW model is designed to predict a target word from a set of context words surrounding it.

In **CBOW**, we take a word and its surrounding words. This can be any number, from 1 to n. For instance, if the target word is “deep” in the sentence “I am learning deep learning,” and the window size is 2, the context words would be “I,” “am,” “learning,” and “learning.” All of these are encoded as one hot encoding and then fed to the model with the expected output being the vector for “deep”.

The CBOW architecture can be broken down into the following components:

**Input Layer**: This layer consists of the context words. Each context word is one-hot encoded with a size equal to the vocabulary.**Projection Layer**: The one-hot encoded vectors are projected onto a shared hidden layer. Instead of performing a matrix multiplication as typical neural networks do, this projection is simply an averaging of the embeddings of the context words.**Output Layer**: The output layer is a softmax layer that predicts the target word. The softmax function is used to convert the outputs to probabilities, where the target probability is maximized.

Skipgram works on the opposite principle of CBOW. Instead of predicting a target word based on the surrounding context words, it tries to predict the context words based on a given word. This becomes useful for less frequent words as it samples the surrounding words more efficiently. For example, if the target word is “deep” in the sentence “I am learning deep learning techniques,” and the window size is set to 2, Skip-Gram tries to predict the likelihood of “learning,” “learning,” “am,” and “I” given the word “deep.”

The architecture of the Skip-Gram model is simple yet effective:

**Input Layer**: The input layer takes a single word in its one-hot encoded form. This word acts as the target word from which the context will be predicted.**Projection Layer**: Similar to CBOW, the one-hot encoded input vector is used to retrieve a corresponding dense vector from the embedding matrix. This vector represents the target word and serves as the input to the next layer.**Output Layer**: The output layer is significantly different from CBOW. Instead of a single softmax layer, Skip-Gram has multiple softmax classifiers equal to the number of context words being predicted. For example, for a window size of 2 on each side of the target, there would be 4 softmax outputs. Each softmax predicts the probability distribution over the vocabulary for one context position.

FastText extends the ideas of Word2Vec to consider subword information, such as character n-grams. It was developed by Facebook’s AI Research (FAIR) lab. This allows the model to capture morphological information (e.g., prefixes, suffixes, and the roots of words) and to generate embeddings for words not seen during training.

There are other advanced models too that specialize in Word Embeddings. These include BERTs and other transformer-based models that use LSTMS and RCNN structures for this task. We won’t go into much of their detail in this post.

In conclusion, word embeddings are a vital innovation in natural language processing, enabling machines to grasp and process human language with remarkable depth. These techniques have revolutionized how we handle language in technology. As word embeddings continue to evolve, they promise to enhance a wide range of AI applications, making digital interactions more intuitive and impactful. Whether you’re in tech or just curious about AI, understanding word embeddings is essential for anyone looking to keep up with the latest advancements in machine learning.

What’s more? Learn about Generative Adversarial Networks from scratch!

The post But What Are Word Embeddings? appeared first on ML-DL.

]]>Generative Adversarial Networks (GANs) From Scratch Read More »

The post Generative Adversarial Networks (GANs) From Scratch appeared first on ML-DL.

]]>Well, it’s because GANs work on a fascinating dynamic between two networks: the Generator and the Discriminator. The Generator takes in random noise and tries to generate an output. At first, this output might seem like random noise, but the magic happens as it gets refined through a process called gradient descent learning. Think of it like sculpting—shaping that noise into something meaningful.

Now, here’s where the adversarial aspect kicks in. We need a way to guide the Generator in the right direction. Since we don’t have a specific desired output, we have to think broadly. One clever approach is to classify the output as either real or fake and then provide feedback based on that.

Enter the Discriminator. Its job is to differentiate between real and fake data. We throw both the Generator’s output (the fake sample) and a real sample at it. The Discriminator then does its thing, ideally giving a 1 for real input and a 0 for fake input.

Before we give any feedback to the Generator, there’s another round of training involved. This part can be a bit confusing at first, but bear with me. We train the Discriminator first, teaching it to distinguish between real and fake. Once it’s got that down, it’s the Generator’s turn. Now, the Generator’s goal is to fool the Discriminator—making it classify the fake data as real.

To achieve this, we feed the Generator’s output into the Discriminator, calculate the loss, and then adjust the Generator’s weights accordingly. Crucially, we leave the Discriminator untouched since it’s already been trained.

In essence, the Generator and Discriminator work together as a sort of tag team, but with the Discriminator’s weights frozen during the Generator’s training phase.

The original paper uses the following training algorithm :

[latexpage]

Sure, here’s the algorithm with plain text equations in paragraph form:

Algorithm 1: Minibatch Stochastic Gradient Descent Training of Generative Adversarial Nets

Set the number of steps for updating the discriminator, \( k \) (a hyperparameter, typically set to 1 in experiments).

For each training iteration, iterate over each of the \( k \) steps.

Sample a minibatch of \( m \) noise samples \({z(1), . . . , z(m)}\) from the noise prior \( p_g(z) \).

Sample a minibatch of \( m \) examples \({x(1), . . . , x(m)}\) from the data generating distribution \( p_{\text{data}}(x) \).

Update the discriminator by ascending its stochastic gradient:

\[

\text{Gradient}_d = \frac{1}{m} \sum_{i=1}^m [\log D(x(i)) + \log(1 – D(G(z(i))))]

\]

After completing the steps for updating the discriminator, sample another minibatch of \( m \) noise samples \({z(1), . . . , z(m)}\) from the noise prior \( p_g(z) \). Update the generator by descending its stochastic gradient:

\[

\text{Gradient}_g = \frac{1}{m} \sum_{i=1}^m \log(1 – D(G(z(i))))

\]

The gradient-based updates can utilize any standard gradient-based learning rule, with momentum being used in the experiments.

Noise prior here means random noise that we would feed to our model and the data-generating distribution simply means the real data.

We will implement a basic model, that uses images since this is the most trending usage. We will work with anime facial images found here.

```
# imports
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, Dataset
import os
from skimage import io, transform
```

The Generator object would take input of the image shape which consists of batch_size, channels, dimension 1, and dimension 2. Along with that, it takes input from the latent dimension, which will be the dimension of the noise that will be fed to the model. Why is it called latent dimension? Well latent dimension comes from latent space which is the compressed version of an image, which captures all the details. Since this lower dimensional noise input will be upscaled to a new image, we prefer calling it the latent dimension. Not a very good explanation, but should do for now.

The Discriminator object takes input from the image shape to use it for its model and outputs the binary classification that is whether it’s a real or fake image.

```
class Generator(nn.Module):
def __init__(self, latent_dim, img_shape):
super(Generator, self).__init__()
self.img_shape = img_shape
self.model = nn.Sequential(
nn.Linear(latent_dim, 128),
nn.LeakyReLU(0.2, inplace=True),
nn.Linear(128, 256),
nn.BatchNorm1d(256),
nn.LeakyReLU(0.2, inplace=True),
nn.Linear(256, 512),
nn.BatchNorm1d(512),
nn.LeakyReLU(0.2, inplace=True),
nn.Linear(512, img_shape[1]*img_shape[2]*img_shape[3]), # Adjusted output size to match img_shape
nn.Tanh(),
)
def forward(self, z):
img = self.model(z)
img = img.reshape(*self.img_shape)
return img
class Discriminator(nn.Module):
def __init__(self, img_shape):
super(Discriminator, self).__init__()
self.img_shape = img_shape
self.flatten_dim = img_shape[1] * img_shape[2] * img_shape[3] # Compute the flatten dimension
self.model = nn.Sequential(
nn.Linear(self.flatten_dim, 512),
nn.LeakyReLU(0.2, inplace=True),
nn.Linear(512, 256),
nn.LeakyReLU(0.2, inplace=True),
nn.Linear(256, 1),
nn.Sigmoid(),
)
def forward(self, img):
# Flatten the input image
img_flat = img.view(-1, self.flatten_dim)
validity = self.model(img_flat)
batch_size = img.size(0)
validity = validity.reshape(batch_size, 1)
return validity
```

Basic custom dataset loading that we will use further in our training.

```
class AnimeDataset(Dataset):
"""Face Landmarks dataset."""
def __init__(self, root_dir, transform=None):
"""
Arguments:
root_dir (string): Directory with all the images.
transform (callable, optional): Optional transform to be applied
on a sample.
"""
self.images = []
images = os.listdir(root_dir)
for i in range(len(images)):
if 'seed' in images[i]:
self.images.append(images[i])
self.root_dir = root_dir
self.transform = transform
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
img_name = os.path.join(self.root_dir,
self.images[idx])
image = io.imread(img_name)
if self.transform:
image = self.transform(image)
return image
```

The training process is simple, for every iteration of training, we first train the discriminator, by feeding the fake image and the real image, and later averaging their losses, we calculate the gradients and backpropagate and train the Discriminator.

Once that’s done, we repeat the process with a new fake image generated, passed through the discriminator, and its loss calculated and backpropagated, but we only update the weights of the generator ( recall we don’t touch the discriminator this time). Similar to the algorithm used in the paper we train the Generator model to fool the discriminator.

```
# training
# Set device
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = torch.device("cuda:0")
# Hyperparameters
lr = 0.0002
batch_size = 64
latent_dim = 100
img_shape = (64,3, 100,100 )
epochs = 20
# Initialize networks
generator = Generator(latent_dim, img_shape).to(device)
discriminator = Discriminator(img_shape).to(device)
# Initialize optimizers and loss function
optimizer_G = optim.Adam(generator.parameters(), lr=lr)
optimizer_D = optim.Adam(discriminator.parameters(), lr=lr)
criterion = nn.BCELoss()
# Prepare data
transform = transforms.Compose([ transforms.ToPILImage(),transforms.Resize((100, 100)),transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
dataset = AnimeDataset(root_dir='/kaggle/input/gananime-lite/out2/', transform = transform)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
# Training loop
for epoch in range(epochs):
for i, real_imgs in enumerate(dataloader):
try:
# Train Discriminator
optimizer_D.zero_grad()
# Reshape real images to match discriminator input shape
real_imgs = real_imgs.to(device)
real_imgs = real_imgs.view(real_imgs.size(0), -1)
batch_size = real_imgs.size(0)
real_labels = torch.ones(batch_size, 1).to(device)
# Generate fake images
z = torch.randn(batch_size, latent_dim).to(device)
fake_imgs = generator(z)
fake_labels = torch.zeros(batch_size, 1).to(device)
# Discriminator loss for real images
real_pred = discriminator(real_imgs)
d_loss_real = criterion(real_pred, real_labels)
# Discriminator loss for fake images
fake_pred = discriminator(fake_imgs.detach())
d_loss_fake = criterion(fake_pred, fake_labels)
# Total discriminator loss
d_loss = d_loss_real + d_loss_fake
d_loss.backward()
optimizer_D.step()
# Train Generator
optimizer_G.zero_grad()
# Generate fake images
z = torch.randn(batch_size, latent_dim).to(device)
fake_imgs = generator(z)
# Discriminator loss for fake images
fake_pred = discriminator(fake_imgs)
g_loss = criterion(fake_pred, real_labels)
# Update Generator
g_loss.backward()
optimizer_G.step()
if i % 100 == 0:
print(
f"[Epoch {epoch}/{epochs}] [Batch {i}/{len(dataloader)}] [D loss: {d_loss.item()}] [G loss: {g_loss.item()}]"
)
except:
print("error occured")
continue
```

After 20 epochs of training, here’s what you get :

Not bad but can be done better for a different configuration. and better learning.

So that sums up our article on Generative Adversarial Networks and implementing them from scratch. Stay tuned for more informative articles.

The post Generative Adversarial Networks (GANs) From Scratch appeared first on ML-DL.

]]>What is Federated Learning Read More »

The post What is Federated Learning appeared first on ML-DL.

]]>Federated Learning is like a team of learners working together without revealing their individual knowledge. I am not sure what analogies you can put forward for this, but the one that I can think of is Shamir’s Secret Sharing. In traditional learning, all data needs to be sent to a central server where the model is trained, but not with federated learning.

**Decentralized Learning:**federated learning is a decentralized model just like blockchain, where each device(PC or mobile) collects and trains the model on the device itself. No data is transmitted to the central servers. These local models are a summary of the data that your device has discovered.**Collaboration:**The data that is shared with the central server is the model itself and not your original data, this ensures that your data is not directly accessible to any intruder, unless and until they hack your device itself.**Combining Knowledge:**Once all the model’s data is received, some aggregation or perhaps other kinds of model combination algorithm is used to get a combined model which then serves the purpose intended.**Iterative Process:**This process continues over time as the device use continues and the models improve over time.

**Privacy Protection:**Needless to say, the process is obviously more secure as your data is never transmitted to remote servers.**Efficiency:**Rather than training a single model with all the data collected, federated learning turns out to be super effective in training all the models locally, leaving the servers with only the load to aggregate.**Personalization:**Since the models are trained locally on your device, they become more personalized over time, performing better on an individual basis.

**Smartphones:**It’s been used for a long time in your keyboards when the keyboard automatically suggests what you are going to type (scary).**Healthcare:**It has some applications in the medical sector where data from the patient can be used without compromising their identity.**Recommendation Systems:**Streaming services like Netflix and Prime already use this to cater the best content to you.

Overall, federated learning is a very promising technology which solves a very crucial problem. Solving privacy whilst also making the whole process more efficient is like hitting two birds with a stone. With the growth in the use of personal computing devices, from watch to televisions more advacnes are being made in this area to make it mroe secure and reliable.

The post What is Federated Learning appeared first on ML-DL.

]]>The post Gradient Descent appeared first on ML-DL.

]]>We learned in the last post about Linear Regression. We concluded with a cost function that we needed to minimize. Today we will see how we minimize this cost function.

To recap, the cost function was :

Here, h𝚯(x) is the linear regression equation that we discussed earlier( y= mx+c).

here m is represented as theta. One can write the equation as

For multivariate linear regression, it would then become

and the corresponding hypothesis function would be then

The c is often called

so we write the above equation as :

Where x0 is 1.

So the hypothesis function that we are working on is :

And the corresponding cost function is :

The superscript *i *just denotes the different x’s that we have, that is the different data points.

Knowing these notations just helps you in case you want to further dive deep into machine learning because there is a lot of maths out there. Just understanding what the various notations mean can help a lot at times.

I hope you guys don’t find trouble understanding this summation part “:

But in case you do, it just means that we are calculating the square of the difference between **the value of the output of the hypothesis function for input x **and **the value of y that we have for x in reality.** I hope that made sense. We calculate the said squared difference for all the values of x that we have in our dataset.

Now comes gradient descent. We need to minimize this squared error. We need to find the optimal value of theta for which this cost function is the minimum.

Look at the cost function closely:

It’s a function in theta(s) because we are going to feed the x’s and y’s to it, the only variable that would remain would be theta(s). We can plot the graph of this function and see how the thetas are affecting them. These thetas are just variables and one could name them *a *and *b *but theta is just the convention.

Since we have two independent variables( the thetas) and one dependent one (the output of the cost function ) it will be a 3d plot. like this :

This looks like a valley. Now meditate on this, we are looking for that 𝚯0 and 𝚯1 for which J(𝚯0,𝚯1) lies in the lowest region of this valley. Try to visualize this by looking at the image above.

Now comes the interesting part, we differentiate the above equation with respect to the variables, one at a time. Read that again. With respect to the variables, one at a time, this has a special term called partial derivative. We differentiate the equation with respect to one variable and assume that the other variables are constant.

This helps us see how the function varies with respect to that variable, keeping everything else constant.

So yes, this is what we do here. Differentiate the equation with respect to one variable( theta). Now in this case, if we freeze 𝚯0, i.e assume that it’s a constant, the function would get reduced to a single variable one, and then differentiating it at a point would give us the slope of that function. Something like this:

Look at that slice, our entire 3-D graph would reduce to that slice, that is a 2-D graph, why? because one of the dimensions is lost, we have assumed it is constant. Now, if you have taken any calculus course, then everything above might seem too boring and baby stuff, but any of my readers who aren’t familiar with calculus might be getting something from this post. Now, what’s next? We calculated the partial derivative for 𝚯1, what’s next?

We pick a random value of 𝚯1 and see what value we get for the partial derivative. This value is nothing but the slope at that point. This picture will tell you better:

Look at that sliced curve, that’s the function we get when we work with just one variable and assume the rest to be constant. Here the second variable (represented by the y-axis) has been taken as constant, see its value is fixed to 1 for the entirety of the slice.

Look at that tangent, it touches the sliced function at just one point, when we differentiate the slice function it gives us the function for slope. When we input a value of 𝚯1, it gives us the value of the slope of the tangent at that point.

The value of the slope has two important properties, the sign ( negative or positive) and the magnitude. The sign tells us whether the function is going up ( increasing) or going down (decreasing), and the magnitude tells us how fast or how slow.

So coming back to our 3-D function :

Since we assumed 𝚯0 to be a constant, try to imagine a slice that takes one constant value of 𝚯0. It will look like this again :

Next, take the derivative and see the slope at any value of 𝚯1, for the above graph, what do you conclude? We see, that as the value increases, the slope goes downhill and becomes steeper and steeper. So, what does it tell you? It tells you that if you keep increasing the value of 𝚯1, you will keep going downhill, great!! that’s what we want, don’t we?

Recall, we want those values of thetas for which we will land at the bottom of the 3-D graph, and going along a path that is going downhill, guarantees that!

We do the same for 𝚯0. But why? for the above image, isn’t just working on 𝚯1 enough to get you down to the valley? Well yes, in the above case it is, but not in a case like this:

Go ahead, freeze 𝚯0, to say, 0.7 above, visualize the slice, try to traverse down, and see where you reach. Do this again, but this time for 0.3, see the difference?

So, we can’t work with one variable all the time and freeze the others, that will take you down for sure but it won’t be the bottommost part, it will be the bottommost part for that slice of course. So what do we do?

We traverse down the valley simultaneously :

Here is the algorithm for updating

We are updating the thetas by subtracting this term from it:

The term :

is nothing but the partial derivative of the cost function with respect to one variable at a time ( see 𝚯j at the bottom). It gives the slope which has a magnitude and a direction. If it is negative, that means if we increase the current value of 𝚯j, we are going downhill, which is good, so we need to increase the value of 𝚯j, therefore subtracting this term will increase the value of 𝚯j because minus and minus make a plus. Similarly, if the slope is positive, we are going uphill, so reducing the value of 𝚯j makes sense.

But then, what’s the α term? It’s called the learning rate, it is a constant value that decides how much we want to increase or decrease the theta value. Increasing its value will result in faster updates and we might reach the bottommost point faster, but not necessarily.

Look at this :

This is an example of a small learning rate, like 0.001, it will result in a small and steady traversal towards the bottom point. But of course, it will take more time, but it will guarantee that you will reach the bottommost point.

Now, what if we have a large learning rate?

If we have a very high learning rate when we update our thetas, we might *miss* the bottommost point and go uphill, and then again go back and forth, missing the bottommost point each time, thus being stuck in a loop.

So you get the idea now, that each time we update the weights, we call it an iteration. We need to keep iterating, till we reach the bottommost point.

Once we reach the bottommost point, that is after a certain number of iterations( theta updates) the value of the cost function would stop decreasing, and won’t improve further,

Now you might understand what guys mean when you see them talking about “training” their machine learning models and talking about “it takes time” haha.

Also, you get why it’s called “gradient descent”. Gradient just means how steep the slope is at a point in the graph, that is, the magnitude of the slope and descent means we are descending to the bottommost point.

That’s all I had for you to know about gradient descent. I know this might seem quite overwhelming at first. But try to visualize and you will get there.

Also, many of you might not have any background in calculus, for those, I won’t suggest you take a calculus course. Instead, I will suggest this youtube channel which is simply epic for explaining math stuff 3Brown1Blue. The visualizations that this guy will present will simply blow your mind. Do ensure to check this out.

There is more to gradient descent, I can’t cover every aspect in a single post, and also I don’t want to overload you with so much information which is not going to immediately help you. I believe it’s a detrimental way of educating. Later, when the need arises I will keep giving out more info on this topic.

The post Gradient Descent appeared first on ML-DL.

]]>Linear Regression in Python: The Naive Way Read More »

The post Linear Regression in Python: The Naive Way appeared first on ML-DL.

]]>Hola mates, long time no C. Yeah, I am just a Python guy and I don’t like C/C++ too much simply because they aren’t that handy. A programming language is a tool that should be easy to use and should allow you to implement and do stuff and not waste your time configuring and reading the manuals.

So today we will see how to implement linear regression in Python. If you went through my older posts you know it’s something the stats guys use all the time. A linear regression model is so simple that even calculators have it now. I will present two approaches, the naive one and the bowtied one. The bowtied way is to learn and the naive one is for practical purposes. But before this, we need some data. Now there are lots of datasets available for this, but then I might have to spend time cleaning and organizing it which will make this post kind of boring… So, we will use randomly generated data!

I will be showing you the most basic form of linear regression, that is, function :

Where `x` is a set of points and `y` is the corresponding value associated with that `x`. If you recall, machine learning is about finding a relation between two sets of values, and in this case, this relation is established by this function only by the values `m` and `c`. Easy peasy.

So here’s what we are gonna do, we will get some random values of `x`, and for the corresponding values of `y` we will just keep 2 times x, i.e. `2*x`. Later we can verify that one of the values i.e `m` is also 2. So let’s begin!

We import the NumPy library which stands for numerical python, it was developed in 2005 by Travis Oliphant and is used for whole hosts of numerical operations and experimentations. It’s a mathematician’s favorite.

We have used the `random` submodule of NumPy which will help us create an array of any dimension that we want. The `np.random.rand` function takes in a variable number of arguments as dimensions of the output that we want. Think of this as a vector, it needs to have dimensions.

We want at least 1000 values of x. Since we wanted a series of values of `x` we will simply put 1000 in the argument.

Next, we generate `y`, as promised it will be 2 times `x`. So here it is:

So, what next? We have our data, the `x` and the `y`. We will split this data into training data: the data we will use to train the model and the testing data: the data we will use to test the model to gauge its accuracy. Since we have 1000 data points, we will use 90% of this data for training and keep aside 10% for testing. So let’s go!

So, we have 900 data points both for `x_train` and `y_train` and 100 data points for both `x_test` and `y_test`, the names are self-explanatory. I have done something called `list-slicing` or in this case, `array-slicing` to be precise, what that is: you will have to explore that yourself buddy, not everything is going to be spoon-fed.

So, we have our training data ready, our testing data ready. We need to just create the model. Since this article will focus on the naive a.k.a the easy way, we will use a library to import the pre-built model and then simply train this model with the data. The library that I am going to use is called Scikit-Learn and is a very popular library used for statistical and machine-learning problems.

We import the LinearRegression class and instantiate an object of this class named naive_model, which we will use to fit the data. To train the model, we will simply call the `fit()` method of this class object with the `x_train` and `y_train` passed to it. However, before doing that, we need to change the format of our data. the `fit()` method expects the `x` and `y` to be of the shape (n_samples, n_features),i.e (number of samples, number of features) since our data has just one feature, i.e we only have one `x` the shape of `x` is (1000,). We need to change this to (1000,1) which means 1000 samples and 1 feature. So, we do a reshape(-1,1). The diagram below shows this:

We now fit and train our model.

That’s it, this simple step trains our model, now let’s see how the model predicts the data. We will call the predict() method and compare the output with one of the test data points.

We predict the value for the 45th data point in `x` and the output that we get is 1.64366699. This means the model predicts that the value of `y` for the 45th value of `x` is 1.6436. Notice, I had to call reshape(-1,1) before feeding the value.

Now let’s see what the actual value of `y` is for this 45th data point.

Hoo-yah! it’s the same! The model predicted this value with 100% accuracy. Now doing this for all the data points would be tiresome wouldn’t it be? So we introduce metrics, metrics help you assess how good your model is, a very simple metric is `mean squared error` known as MSE in short. It simply calculates the difference between the output of the model and the expected output and then takes its square, does this for all the points and then takes an average of it. So, what would a good MSE score for a model be? Think anon. Since we want our model to be good, the difference between the predicted and expected values should be close to zero(and ideally zero) and hence the average should also be close to zero(and ideally zero). Let’s see the MSE of our model.

That’s a very small number, don’t forget the e-31 which means it is 10^-31 times. I better convert this to an integer.

Hoo-yah, it’s 0! So our model is 100% accurate. But you should be worried if this comes out on real-world data because it is rare that you will find data with such a simple relation. Real-world data is all complex and mushy-mushy.

Let’s see the values of `m` and `c` that we need for the linear equation. We get the `coef_` attribute of the model, where sci-kit-learn stores the coefficients.

It’s 2, as expected since we got our `y` by 2*x. Also, notice we don’t have a `c` it means it is 0. This suggests that the line passes through (0,0) on the graph. Let’s verify this, by plotting it.

Hoo-yah! Matplotlib is a tool that we can use to plot graphs and whatnot. I plotted the (0,0) separately to show that the line crosses the origin(0,0).

So that’s it guys, that’s all for the naive implementation. You will use this for practical purposes. However, to gain an in-depth understanding of the algorithms, we will see how to implement all this from scratch using NumPy and Python. Till then, stay tuned.

The post Linear Regression in Python: The Naive Way appeared first on ML-DL.

]]>The post Linear Regression appeared first on ML-DL.

]]>In this post, we will look at a very simple machine learning algorithm which is actually the “hello world” equivalent of programming languages.

So what is linear regression? If you are from a statistics background chances are you already know what it means. In statistics, linear regression is a tool that is used for finding a relation between a dependent variable and an independent variable. It’s that simple. Consider the equation below.

This is a linear equation that captures the relationship between the variables *y *and *x. *The variable *y* and the variable *x *are called dependent and independent variables respectively. The subscript *h *means it is our hypothesis function.

If you are given a set of data points and you think that the points are related to each other linearly, that is, there exists some equation *y=mx +c* which is able to represent the dependent and independent variables, if not exactly then approximately, you would like to apply the linear regression algorithm and try seeing if it is a good fit.

In the above picture, there is a set of data points. We have fit a line through the data points. Notice that the line does not fit all the points, in fact, it just fits a very small number of points, however, it captures the ** trend** of the data. Which is somewhat linear.

Now how do we find this line? More mathematically, how do we find the parameters *m *and *c* for the linear equation we discussed above?

It’s simple. We choose a random line, we calculate the sum of the squared distances of points from this line. We then try to minimize this sum, i.e we find parameters for which this sum is lowest. Why does the sum have to be the minimum for the best parameters? For this, you must first ask what are the best parameters.

The best parameters are the ones for which 1. We get a line that covers most of the points and 2. for those points, it does not cover, we want it to be as close as possible to the line.

If you think about the above two points, you will realize that both requirements can be satisfied if the sum of the squared distances is minimum.

In particular, we will try to minimize the average of the sum of squared distances. We take the average so that our model doesn’t depend on the number of data points.

There are some geometrical and linear algebra reasons why we use squared distances instead of absolute distances, won’t cover that much in detail here, if you are interested, you can refer to them here.

We write the above-mentioned metric in the form of a function and call it the loss function. See this:

Now the goal of linear regression is to reduce the output of the cost function as much as possible i.e we want to minimize this function.

If everything makes sense here, let’s move to the process of minimizing this function.

Now if you have some experience in maths, then you might know that there exists something called analytical solutions which aims at transforming the given problem into a well-known form and then calculating its solution.

But we won’t be doing that here, because that leads to some complex matrix inversion operations which are computationally expensive in the case of multivariate linear regression, i.e regression in which we have more than one *x*.

Next article we will see how to solve this problem.

The post Linear Regression appeared first on ML-DL.

]]>