YFret provides recommender systems that can be used in industries such as e-commerce, travel, content-sites etc. In our series about recommender systems, the previous post explored the role of different types of recommendation in the e-commerce lifecycle. In this post, we will be looking at the business need and the technical implementation of content-based recommendation system.
Need for Content-based recommendations
Recommender systems can be broadly classified into Content-based system and Collaborative-filtering systems. Content-based systems generate recommendations based on product attributes (or any data available in the form of objects, for example, blog posts in the case of content-sites). Collaborative-filtering generates recommendations based on user behavior.
In most use-cases, collaborative filtering works better than content-based recommendations because it brings out more obscure user behavior patterns which are not apparent from the product data. So why do we need content-based recommendations at all? Because in a brand-new website which lacks user activity, collaborative-filtering presents a catch-22.
The catch-22 is that collaborative-filtering needs quality user activity in the site for generating recommendations, but quality recommendations are very much needed to encourage users to be active on the site.
Even if the site does sufficient user activity it might not be evenly distributed among the products in the site, user activity usually follows an approximation 80:20 rule. 80% of the traffic will be brought in by 20% of the content. Collaborative-filtering fails to generate quality recommendations for clusters of products that lack user-activity.
Content-based recommendations can be used to substitute collaborative-filtering recommendations whenever enough user data is not available.
Generating Content-based Recommendations
As discussed earlier content-based approach generates recommendations based on product attributes without taking user activity into account. But to compute the similarity between products we need to convert product attribute documents into a format that the algorithm can process — vectors, basically a numpy array of numbers that can be used downstream.
Let’s have a look at a sample product and it’s attributes.
We have many descriptive attributes which can be used to generate the product vector. A simple approach to generating the vector is to use one-hot encoding on select attributes like
cloth_type etc.. But by doing this we will be ignoring the other valuable attributes and it wouldn’t work when those attributes are unavailable. Ideally, we will want a method that can make use of as many attributes as possible, while keeping the load on the system reasonable.
Steps to convert product attributes to a vector — code is available here
- Remove non-descriptive attributes such as
descriptioncan be important in a few cases, but since we have many other attributes, in this case, we can safely ignore it.
- The remaining attributes can possibly be of the type,
list(flatten the data so that it does not contain any nested objects), each of these types should be vectorized.
- Numeric data such as
retail_priceetc. is already in the right format, so they are added to the vector after scaling them with MinMaxScaler.
stringtype attributes can be encoded using Tf-Idf Vectorizer, where the attribute value in each product as is treated as a document, and the vocabulary is built with the values of that attribute in all products. Care must be taken to control the vector length by using parameters such as
listcan be encoded into a
stringand can be vectorized the same way.
- Convert the vector matrix to sparse type, to make the algorithm memory efficient.
Now that the documents are in vector format, to ensure that the algorithm has learned the structure of the data we could verify it with a 3D TSNE plot. TNSE is a dimensionality reduction algorithm which reduces an n-dimensional vector to a 3 dimension vector, which is easy to visualize. Below scatter plot shows the spread of 3D product vectors. Each product is color coded with a combination of
gender attributes. As can be seen from the plot, the basic structure in the data is captured in the vectors.
Let’s use the algorithm to provide similar products when given a base product. Basically, this can be imagined as getting the nearest points when given a base point from the 3D scatter plot.
The recommendations generated are similar to the base product provided, it can be used to power a recommendation widget on the base product detail page.
Content-based recommendation is not the goto method for personalized recommendations, but they can still be used as a fallback to more sophisticated recommendation engines. When given a list of products liked by the user, the same similar products logic can be extended to generate personalized recommendations.
That was it, hope you had fun! Please share your comments and thoughts below. I’ll be happy to respond.
The article was originally published on Medium