Imagine you love to read books and don’t have a friend to suggest one. How do you find out books pertaining to your taste? One way is to Google and read descriptions and reviews and select one. If you have done this, you will understand how laborious it is! This is what used to happen before recommendation engines arrived. How magical it is that based on what you have bought, short-listed and viewed in past, Amazon can predict the items you would like to buy. Also, it can find users who have a taste like yours and give you suggestions. Isn’t it similar to getting suggestions from a friend with similar liking!
Before we discuss how these algorithms work, we will have to understand the underlying principles. Once they are cleared, you will get an intuitive feeling of these algorithms and then you can make one of your own. Let’s start by discussing how to find a document which is similar to another document.
As you know, computers cannot understand words. To them, a word like ‘dog’ is just a string of 3 characters. They are ignorant to the fact that its a 4 legged animal which is often domesticated for its loyalty and friendly nature. In the coming articles, we will discuss how can we associate these words and phrases like ‘animal’, ‘4 legged’, ‘loyalty’ to the word ‘dog’ using word2vec. For now, we will just convert the words to integers because that’s what computers understand primarily.
Let’s say we have 4 sentences:
- I am doing fine.
- I am doing great.
- My life sucks.
- I am doing fine. I am doing fine. I am doing fine. I am doing fine.
Lets represent these sentences just by the number of times words were used in it.
I | am | doing | fine | great | My | life | sucks | |
---|---|---|---|---|---|---|---|---|
s1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
s2 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
s3 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
s4 | 4 | 4 | 4 | 4 | 0 | 0 | 0 | 0 |
Vectorised sentences:
s1 = (1,1,1,1,0,0,0,0)
s2 = (1,1,1,0,1,0,0,0)
s3 = (0,0,0,0,0,1,1,1)
s4 = (4,4,4,4,0,0,0,0)
Okay, so far we have boiled down sentences -> words -> numbers -> vectors. What does the numbers in vectors signify? We have dealt with 3 dimensional space in our high-school where we represented points in 3-d space by (x,y,z). Over here we are using 8 dimensions. The dimensions denote the features that we are dealing with and its value tells us about the strength of that feature in the document. I will use the words text and document interchangable as required.
Now, we will find similarity between them using 2 methods.
The dimensions denote the features that we are dealing with and its value tells us about the strength of that feature in the document.
Euclidean distance
Since we are working with 8 dimensions, where we are representing the document by a vector in this space, we can calculate the distance between these vectors(document) by Euclidean distance. The more the distance, the more the dissimilarity.
s1 | s2 | s3 | s4 | |
---|---|---|---|---|
s1 | 0 | 1.4 | 2.6 | 6 |
s2 | 1.4 | 0 | 2.6 | 6.6 |
s3 | 2.6 | 2.6 | 0 | 8.2 |
s4 | 6 | 6.6 | 8.2 | 0 |
- I am doing fine.
- I am doing great.
- My life sucks.
- I am doing fine. I am doing fine. I am doing fine. I am doing fine.
From the sentences we can see that sentences 1,2 and 4 convey the same meaning and all 3 are different from 3rd sentence. Hence in the row of s3, all the values are high except that 0 which is its distance from self. But euclidean(s3,s4) and euclidean(s1,s4) are quite high. s4 is just s1 repeated 4 times and is fundamentally same (maybe a little exaggerated). But if the value of euclidean(s1,s4) is high, it means that they are very different which is not the case.
This basically proves that Euclidean distance doesn’t exactly look for similarity. It just takes the difference of word counts, including those which are not common.
Cosine similarity
I am sure that you must have definitely dealt with finding cosine angle between 2 vectors by taking a dot product of their unit vectors. A unit vector is one which tells you the direction of the vector. So indirectly it tells us about the concepts in the document and not about how strong are the concepts in the document.
s1 | s2 | s3 | s4 | |
---|---|---|---|---|
s1 | 0 | 0.25 | 1 | 0 |
s2 | 0.25 | 0 | 1 | 0.25 |
s3 | 1 | 1 | 0 | 1 |
s4 | 0 | 0.25 | 1 | 0 |
- I am doing fine.
- I am doing great.
- My life sucks.
- I am doing fine. I am doing fine. I am doing fine. I am doing fine.
Applying the same comparison again, the row of s3 should have the highest value which is the case this time. Also, Euclidean(s1,s4) is equal to 0 meaning both are very similar or same. Although, s1 – I am doing fine and s2 – I am doing great also mean the same thing, our algo considers ‘fine’ and ‘great’ as different words because it doesn’t know how language works. Euclidean(s1,s2) is 0.25 which is fair enough similarity.
Intuition
One intuitive way to think about this idea is to consider the 2 components of a vector: direction and magnitude. Direction is the “preference” / “style” / “sentiment” of the vector, while the magnitude is how strong it is towards that direction. Cosine similarity is calculated using only the dot product and magnitude of each vector, and is therefore affected only by the terms the two vectors have in common, whereas Euclidean has a term for every dimension which is non-zero in either vector. Cosine thus has some meaningful semantics for ranking similar documents, based on mutual term frequency, whereas Euclidean does not.
Cosine thus has some meaningful semantics for ranking similar documents, based on mutual term frequency, whereas Euclidean does not.
That’s it for now! We will discuss how to combine the concepts of word frequencies and cosine similarity in our next part with advanced algorithms like TF-IDF and BM25.
Let us know your feedback by commenting below.
An easy way to calculate these values is by using Scipy library in Python
from scipy.spatial.distance import euclidean, cosine
euclidean(s1,s2)
cosine(s1,s2)
An AI evangelist and a multi-disciplinary engineer. Loves to read business and psychology during leisure time. Connect with him any time on LinkedIn for a quick chat on AI!