Content-Based Recommender System
Content-based filtering is a type of recommender system. Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.
Content-based Filtering Advantages:
- Making suggestions doesn’t need any information from previous users.
- The user will find recommendations to be very relevant.
- Systems for content-based filtering are typically simpler to develop.
Content-based Filtering disadvantages:
- There isn’t enough variety or novelty.
- Scalability is difficult.
- The attributes might be wrong or inconsistent.
Black Panther: Wakanda Forever tells the story of Queen Ramonda, Shuri, M’Baku, Okoye and Dora Milaje, who are ready to fight against the world powers that threaten their nation. After King T’Challa’s death, world powers emerge trying to interfere with them. Determined to defend their nation against this threat, Queen Ramonda, Shuri, M’Baku, Okoye and Dora Milaje are ready to fight. Moving on to the next chapter of their lives, the people of Wakanda struggle to embrace it. Meanwhile, with the help of War Dog Nakia and Everett Ross, the heroes try to chart a new path for the Kingdom of Wakanda.
In order for someone who liked the above description “Black Panther: Wakanda Forever” to suggest a content-based movie, a mathematical form should be obtained by using the explanations. Then comparisons should be made with other films and similar explanations should be found.
Represent texts mathematically means to make these texts mathematically measurable. If the texts can be represented by vectors, then mathematical operations can be performed. We have 2 methods to express the existing text mathematically while filtering based on content:
1. Count Vector
- While the documents are added to the rows, each unique term is added to the columns. (Term: Word, I will use sample sentences as a document in this study, but documents can be many things like movie commentary, song commentary, tweets, etc.)
- The frequencies of the terms in the documents are placed in the intersection cells.
2. TF-IDF
- Count Vectorizer Calculation (frequency of words in each document)
- TF — Term Frequency Calculation (frequency of term t in the relevant document / total number of terms in the document)
- IDF — Inverse Document Frequency Calculation: 1 + loge((total number of documents + 1) / (number of documents with term t in it + 1))
After expressing our texts mathematically (vectorally), we can find out what we should suggest by calculating the similarity-distance between them. I will talk about 2 methods we need to know.
1.Euclidean Distance
Euclidean Distance is meant to find the minimum distance between two vectors. Calculation of Euclidean distance is the same as we have learned in our school days. It is widely used in Natural Processing Language tasks.
As the distance calculated by the Euclidean distance decreases, the similarity between the two content we compare increases. In the table below, we can see how many words are included in the description of some movies.
To find similarity between films, we find Euclidean distances. As we can see, movie 3 and movie m are quite similar to each other.
2.Cosine Similarity
In data analysis, cosine similarity is a measure of similarity between two sequences of numbers.
Cosine similarity then gives a useful measure of how similar two documents are likely to be, in terms of their subject matter, and independently of the length of the documents.
Let’s make an example right away without being too theoretical!
The dataset used in this study can be found here.
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
pd.set_option('display.expand_frame_repr', False)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df = pd.read_csv(r"C:\Users\cilil\Desktop\veri_bilimi\recommandation_systems\datasets\the_movies_dataset\movies_metadata.csv", low_memory=False) # To turn off DtypeWarning
df.head()
df.columns
When we examined the variables of our dataset, we saw the overview variable, which contains explanations about the movies. We will set out from this variable to do content-based filtering.
df["overview"].head()
First we catch the similarities using cosine_similarity.
def calculate_cosine_sim(dataframe):
#We use stop words to extract frequently used and non-measured values such as and,in,on etc.
tfidf = TfidfVectorizer(stop_words='english')
dataframe['overview'] = dataframe['overview'].fillna('')
tfidf_matrix = tfidf.fit_transform(dataframe['overview'])
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
return cosine_sim
Then we write our function that will make suggestions.
def content_based_recommender(title, cosine_sim, dataframe):
# creating indexes
indices = pd.Series(dataframe.index, index=dataframe['title'])
indices = indices[~indices.index.duplicated(keep='last')]
# capturing title's index
movie_index = indices[title]
# Calculating similarity scores by title
similarity_scores = pd.DataFrame(cosine_sim[movie_index], columns=["score"])
movie_indices = similarity_scores.sort_values("score", ascending=False)[1:11].index
return dataframe['title'].iloc[movie_indices]
Now let’s try our functions and find our movies that look like “The Dark Knight Rises”.
cosine_sim = calculate_cosine_sim(df)
content_based_recommender('The Dark Knight Rises', cosine_sim, df)
Output:
Now we know what we can suggest to someone who likes “The Dark Knight” similar to that movie :)
I hope this study was helpful, see you soon!