Similarity learning is a method for unsupervised machine learning (ML) method in which the model “learns” the structure of the data based on the similarity or dissimilarity of elements in the dataset. It is a powerful tool in the context of vector embeddings generated from unstructured data (e.g., video, images, audio, prose, etc.).
Unstructured data frequently comes without labels because the cost involved to have a person(s) apply labels is prohibitive. But with similarity learning we can still perform a range of useful operations on a dataset without labels, or even a preconceived notion of the data’s distribution.
Consider recommendation problems. You are trying to recommend something of interest to a user or customer. It might be a song, an article, a piece of clothing, a movie and so on. If you have some sense of the user’s prior habits, you can use similarity learning to attack this problem even if you don’t have labeled data.
In this article, we are going to provide an introduction to similarity learning in the context of vector embeddings and how you can combine these concepts to solve problems.
Similarity learning in practice
Our starting point for similarity learning is the generation of vector embeddings. We won’t cover how to generate vector embeddings here other than to say once you have them, you now have an n-d numerical object on which you can perform distance calculations.
These distance calculations can provide you with semantic insight into your data. To expand on the recommendation problem from earlier, let’s assume you have a dataset composed of the vector embeddings of various movies. Let’s also assume that you have a user’s viewing history. By comparing the distance between a user’s movies and other movies in your dataset, you can gain insight into the similarity between movies the user has watched, and movies they can choose from. This enables you to make a helpful recommendation.
There are various ways of determining similarity between vector embeddings based on distance metrics including Euclidean, cosine and Jaccard similarity. You can find more information on distance metrics here. But for the remainder of this article we will focus on similarity learning using cosine distance.
Determining similarity using cosine distance
Cosine distance is a measure of distance between two vector embeddings that is based on the angle between them. A value of 1 indicates that the vectors are perfectly similar (i.e., point in the same direction), whereas a value of 0 indicates the vectors are completely dissimilar (i.e., orthogonal), and a value of -1 indicating that the vectors are perfectly dissimilar (i.e., point in opposite directions). The diagram below helps illustrate this.
To calculate the cosine similarity between two vectors, we use the vectors’ coordinate pairs Suppose we have two vectors, A and B, represented as the following coordinate pairs:
A = (a1, a2, a3, ..., an) B = (b1, b2, b3, ..., bn)
Notice here that we are not bounded by a particular number of dimensions. We can use cosine similarity in n-d space meaning that it will enable us to generate a similarity score even for high-dimensional vector embeddings.
Once we have our coordinate pairs, we next calculate the dot product of the vectors followed by their magnitudes:
A Python implementation might look as follows:
dot product = (a1 * b1) + (a2 * b2) + (a3 * b3) + ... + (an * bn) magnitude of A = sqrt(a1^2 + a2^2 + a3^2 + ... + an^2) magnitude of B = sqrt(b1^2 + b2^2 + b3^2 + ... + bn^2)
Finally, we use these values to return the cosine similarity:
A Python implementation might look as follows:
cosine similarity = dot product / (magnitude of A * magnitude of B)
Considerations when using cosine similarity
While cosine similarity is a powerful tool for solving various problems related to vector embeddings, caution needs to be used with this technique, like any other ML method. A starting point for this discussion is the use cases for which cosine similarity is best suited.
Mathematically, cosine similarity is most appropriate when we are focused on the direction of the vectors we are analyzing rather than their magnitude. What this means in practice is that cosine similarity is insensitive to scale.
Imagine that you are trying to compare the ratings that restaurant diners give to a restaurant after eating there. If you convert these ratings to vectors and use cosine similarity to compare them, the resulting similarity score will be unaffected by the scale used (e.g., 5 stars, 1 to 10, etc.).
If you were to use Euclidean distance in this situation, the resulting distance would be sensitive to the scale of the ratings, since the distance would be influenced by the magnitude of the vectors.
Another consideration is dimensionality. If you are working with very high dimensionality vectors, cosine similarity may not be an effective measure of similarity. This is because many of those dimensions may capture noise or be unrelated to the underlying meaning you are trying to identify.
Furthermore, dot product and magnitude calculations are affected by outliers. Cosine similarity may therefore deliver skewed results if you have outliers present.
Other issues for cosine similarity are matters of good hygiene: ideally your vectors should be of the same length and not include zero elements.
In closing, cosine similarity and other similarity learning techniques can be especially powerful tools when deployed alongside vector embeddings. They provide a mathematical way to capture insights into your unstructured data absent any labeling or supervision which makes it much more economical to solve problems such as recommendation using large amounts of unstructured data.
- “Data Science Concepts and Practice” - Vijay Kotu and Bala Deshpande: