Recommender Systems for Modeling Feature Interactions under Sparse Settings
Many machine learning domains, such as recommender systems, targeted advertisement, search ranking, and text analysis contain highly sparse data because of the large categorical variable domains. This sparsity makes it hard for ML algorithms to model secondorder and above feature interactions. In this article, I summarize the need for modeling feature interactions and introduce some of the most popular ML architectures designed for estimating interactions from sparse data.
Introduction
Problem with sparse inputs
A variety of Information Retrieval and Data Mining tasks, such as Recommender Systems, Targeted Advertising, Search Ranking, etc. model discrete and categorical variables extensively. As an example, these variables could be user IDs, product IDs, advertisement IDs, and user demographics such as gender and occupation. A common technique to use these categorical variables in machine learning algorithms is to convert them to a set of binary vectors via onehot encoding. This means that for a system with a large userbase and catalog the resultant feature vector is highly sparse. As explained next, this highly sparse setting makes it difficult to incorporate feature interactions.
Why modeling feature interaction is important?
As an example, consider an artificial Clickthrough rate (CTR) dataset shown in the table below, where + and  represent the number of clicked and unclicked impressions respectively.
We can see that an ad from Gucci has a high CTR on Vogue. It is difficult for linear models to learn this information because they learn the two weights Gucci and Vogue separately. To address this problem, a machine learning model will have to learn the effect of their feature interaction. Algorithms such as Poly2 (degree2 polynomial) do this by learning a dedicated weight for each feature conjunction ($O(n^{2})$ time complexity).
Note: An order2 interaction can be between two features like app category and timestamp. For example, people often download Uber apps for food delivery at mealtime. Similarly, an order3 interaction can be between app category, user gender, and age. For example, a report showed that male teenagers like shooting and RPG games. Research shows that considering low and highorder feature interactions simultaneously brings additional improvement over the cases of considering either alone^{1}. For a highly sparse dataset, these feature interactions are often hidden and difficult to identify a priori.
Memorization and Generalization paradigms
One common challenge in building such applications is to achieve both memorization and generalization.
 Memorization is about learning the frequent cooccurrence of features and exploiting the correlations observed from the historical data. Google Play store recommender system might look into two features ‘installed_app’ and ‘impression_app’ to calculate the probability of a user installing the impression_app. Memorizationbased algorithms use crossproduct transformation and look at the features, such as “AND(previously_installed_app=netflix, current_impression_app=spotify)”, whose value is 1 and correlate it with the target label. While such recommendations are topical and directly relevant based on the user’s previous actions, they can not generalize when crossfeature interaction never happened in the training data. For example, there is no training data for the pair (NBC, Gucci) in the table above. Creating crossproduct transformations may also require a lot of manual feature engineering effort.
 Generalization is based on the transitivity of correlation and exploring new feature combinations that have never or rarely occurred in the past. To achieve this generalization we use embeddingsbased methods, such as factorization machines or deep neural networks by learning a lowdimensional dense embedding vector. While such recommenders tend to improve the diversity of the recommended items, they also suffer from the problem of overgeneralizing. These methods have to learn effective lowdimensional representations from sparse and highrank data. For cases such as users with specific preferences, or niche items with a narrow appeal, dense embeddings lead to nonzero predictions and thus make less relevant recommendations. Comparatively, it is easier for generalized linear models with crossproduct feature transformations to memorize these “exception rules” with much fewer parameters.
In this article, I will introduce, implement and compare seven popular frameworks to achieve both memorization and generalization in individual model architectures.
Dataset Preprocessing
For this article, I will be using the MovieLens 100K dataset^{2} from Kaggle. It contains 100K ratings from 943 users on 1682 movies, along with demographic information for the user. Each user has rated at least 20 movies. The following transformations were applied to prepare the dataset for experimentation.
 Dataset was sorted by user_id and timestamp to create time ordering for each user.
 A target column was created which is timewise the next movie that the user will watch.
 New columns such as average movie rating per user, average movie rating per genre per user, number of movies watched per user in each genre normalized by total movies watched by that user, etc. were created.
 Gender column was label encoded, and the occupation column was dummy encoded.
 Continuous features were scaled as appropriate.
 A sparse binary tensor was created indicating movies that the user has watched previously.
 Dataset was split into traintest with an 8020 ratio.
The following diagram shows the various features generated through preprocessing. Some or all of these features are used during experimentation depending upon the specific model architecture.
Refer to the respective notebook for data preprocessing steps specific to each model.
Experimentation
Each of the model implementations is done in its standalone Jupyter notebook file and is linked in the corresponding section below. Every notebook contains code for data preprocessing, model definition, training, and evaluation as appropriate to the respective model.
1. Factorization Machine (FM)
Factorization Machines (FMs) were originally purposed as a supervised machine learning method for collaborative recommendations in 2010. FMs allow parameter estimation under very sparse data where its predecessors, like SVMs failed, and unlike Matrix Factorization it isn’t limited to modeling the relation of two entities only. FMs learn secondorder feature interactions on top of a linear model by modeling all interactions between each pair of features. They can also be optimized to work in linear time complexity. While in principle, FM can model highorder feature interaction, in practice usually only order2 features are considered due to high complexity.
where $w_{0}$ is the global bias, $w_{i}$ denotes the weight of the ith feature, and $ŵ_{ij}$ denotes the weight of the cross feature $x_{i}x_{j}$, which is factorized as: $ŵ_{ij} = v_{i}^{T} v_{j}$, where $v_{i} \in R^{k}$ denotes the embedding vector for feature i, and k denotes the size of embedding vector. Note that due to the coefficient $x_{i}x_{j}$, only interactions between nonzero features are considered.
The example above shows a sparse input feature vector x with corresponding target y. We have binary indicator variables for the user, movie, and the last movie rated, along with a normalized rating for other movies and timestamps. Let’s say we want to estimate the interaction between the user Alice (A) and the movie Star Trek (ST). You will notice that in x we do not have any example in x with an interaction between A and ST. FM can still estimate this by using the factorized interaction parameters $\langle v_{A}, v_{ST} \rangle$ even in this case.
Implementation
Suppose we have M training objects, n features and we want to factorize feature interaction with vectors of size k i.e. dimensionality of $v_{i}$. Let us denote our trainset as $X \in R^{M×n}$, and matrix of $v_{i}$ (the ith row is $v_{i}$) as $V \in R^{n×k}$. Also, let’s denote the feature vector for the jth object as $x_{j}$. So:
The number in brackets indicates the index of the sample for x and the index of the feature for v. Also, the last term in the FM equation can be expressed as:
$S_{1}$ is a dot product of feature vector $x_{j}$ and ith column of V. If we multiply X and V, we get:
So if square XV elementwise and then find the sum of each row, we obtain a vector of $S_{1}^{2}$ terms for each training sample. Also, if we first square X and V elementwise, then multiply them, and finally sum by rows, we’ll get $S_{2}$ term for each training object. So, conceptually, we can express the final term like this:
PyTorch Code


Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/2a639276fc781870c4dcd480a3417bf9
You can read more about FMs in the original paper and the official codebase.
2. FieldAware Factorization Machine (FFM)
Let’s revisit the artificial dataset example from earlier and add another column Gender to it. One observation of such a dataset might look like this:
FMs will model the effect of feature conjunction as: $w_{ESPN} \cdot w_{Nike} + w_{ESPN} \cdot w_{Male} + w_{Nike} \cdot w_{Male}$. Fieldaware factorization machines (FFMs) extend FMs by introducing the concept of fields and features. With the same example, Publisher, Advertiser, and Gender will be called fields, while values like ESPN, Nike, and Male will be called their features. Note that in FMs every feature has only one latent vector to learn the latent effect with any other features. For example, for ESPN $w_{ESPN}$ is used to learn the latent effect with Nike ($w_{ESPN} \cdot w_{Nike}$) and Male ($w_{ESPN} \cdot w_{Male}$). However, because Nike and Male belong to different fields, the latent effects of (EPSN, Nike) and (EPSN, Male) may be different.
In FFMs, each feature has several latent vectors. Depending on the field of other features, one of them is used to do the inner product. So, for the same example, the feature interaction effect is modelled by FFM as: $w_{ESPN,A} \cdot w_{Nike,P} + w_{ESPN,G} \cdot w_{Male,P} + w_{Nike,G} \cdot w_{Male,A}$. To learn the latent effect of (ESPN, NIKE), for example, $w_{ESPN,A}$ is used because Nike belongs to the field Advertiser, and $w_{Nike,P}$ is used because ESPN belongs to the field Publisher.
If f is the number of fields, then the number of variables of FFMs is nfk, and the time complexity is $O(\bar{n}^{2}k)$. Note that because each latent vector in FFMs only needs to learn the effect with a specific field, usually: $k_{FFM} \ll k_{FM}$. FFM authors empirically show that for large, sparse datasets with many categorical features, FFMs perform better than FMs. Whereas for small and dense datasets or numerical datasets, FMs perform better than FFMs.
Implementation
PyTorch Code


Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/c6a8037f4596a8181376313fdba33ffd
You can read more about FFMs in the original paper and the official codebase.
3. Attentional Factorization Machine (AFM)
Note how all interactions in FM are weighted with interaction weight matrix W. This W is a positive definite matrix and it can be shown that given a sufficiently large value of k, there will always be a matrix V such that $W=V\cdot V^{T}$. In sparse settings, we usually restrict k to a smaller value which also leads to better generalizability. However, despite its effectiveness, FMs model all feature interactions with the same weight, but not all feature interactions are equally useful and predictive. For example, the interactions with useless features may even introduce noise leading to degraded model performance. Attentional Factorization Machines (AFMs) fix this by learning the importance of each feature interaction from data via a neural attention network.
AFM starts with sparse input and embedding layer, and inspired by FM’s inner product, it expands m vectors to m(m1)/2 interacted vectors, where each interacted vector is the elementwise product of two distinct vectors to encode their interaction. The output of the pairwise interaction layer is fed into an attentionbased pooling layer, the idea here is to allow different parts to contribute differently when compressing them to a single representation. An attention mechanism is applied to feature interactions by performing a weighted sum on the interacted vectors. This weight or attention score can be thought of as the importance of weight $\hat{w}_{ij}$ in predicting the target. However, we have a problem when the features have never cooccurred in the training data. To address this problem, the authors further parameterize attention scores with a multilayer perceptron (MLP). The attention scores are also normalized through the softmax function. One shortcoming of AFM architecture is that it models only secondorder feature interactions.
Implementation
PyTorch Code


Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/85fe046691c66221bec00bc7e59e145b
You can read more about AFMs in the original paper and the official codebase.
4. Wide & Deep Learning
The Wide & Deep learning framework was proposed by Google to achieve both memorization and generalization in one model, by jointly training a linear model component and a neural network component. The wide component is a generalized linear model to which raw input features and transformed features (such as crossproduct transformations) are supplied. The deep component is a feedforward neural network, which consumes sparse categorical features in embedding vector form. The wide and deep part are combined using a weighted sum of their output log odds as the prediction which is then fed to a common loss function for joint training.
The authors claim that wide linear models can effectively memorize sparse feature interactions using crossproduct feature transformations, while deep neural networks can generalize to previously unseen feature interactions through lowdimensional embeddings. Note that the wide and deep parts work with two different inputs and the input to the wide part still relies on expertise feature engineering. The model also suffers from the same problem as the linear models like Polynomial Regression that features interactions can not be learned for unobserved cross features.
Implementation
PyTorch Code


Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/a6ab97ed6bc053aaf3d73320b4b31b97
You can read more about Wide&Deep in the original paper and the official codebase.
5. DeepFM
DeepFM model architecture combines the power of FMs and deep learning to overcome the issues with Wide&Deep networks. DeepFM uses a single shared input to its wide and deep parts, with no need of any special feature engineering besides raw features. It models loworder feature interactions like FM and models highorder feature interactions like DNN.
Implementation
PyTorch Code


Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/79237a4de62b4033e2576c55df3dc056
You can read more about DeepFM in the original paper.
6. Neural Factorization Machine (NFM)
Neural Factorization Machine (NFM) model architecture was also proposed by the authors of the AFM paper with the same goal of overcoming the insufficient linear modeling of feature interactions in FM. After the sparse input and embedding layer, this time the authors propose a BiInteraction layer that models the secondorder feature interactions. This layer is a pooling layer that converts a set of the embedding vectors set input to one vector by performing an elementwise product of vectors: $\sum_{i=1}^{n} \sum_{j=i+1}^{n} x_{i}v_{i} \odot x_{j}v_{j}$, and passes it on to another set of fully connected layers.
Implementation
PyTorch Code


Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/f29d7001b7687785f33636c9bca302c3
You can read more about NFM in the original paper and the official codebase.
7. Deep Learning Recommendation Model (DLRM)
Deep Learning Recommendation Model (DLRM) was proposed by Facebook in 2019. The DLRM architecture can be thought of as a simplified version of DeepFM architecture. DLRM tries to stay away from highorder interactions by not passing the embedded categorical features through an MLP layer. First, it makes the continuous features go through a “bottom” MLP layer such that they have the same length as the embedding vectors. Then, it computes the dot product between all combinations of embedding vectors and the bottom MLP output from the previous step. The dot product output is then concatenated with the bottom MLP output and is passed to a “top” MLP layer to compute the final output.
This architecture design is tailored to mimic the way Factorization Machines compute the secondorder interactions between the embeddings, and the paper also says:
We argue that higherorder interactions beyond secondorder found in other networks may not necessarily be worth the additional computational/memory cost.
Implementation
The paper also notes that DLRM contains far more parameters than common models like CNNs, RNNs, GANs, and transformers, making the training time for this model go up to several weeks. They also propose a framework to parallelize DLRM operations. Due to high compute requirements, I couldn’t train the DLRM model myself. DLRM results in this experimentation are with a simplified implementation using a concatenation of bottom MLP output and embedding output, instead of their dot product.
PyTorch Code


Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/a02a83fbb3ae5e293fde4b90e3a319d7
You can read more about DLRM in the original paper and the official codebase.
Comparing all models
The mean rank of the test set examples was computed using each model, and the following chart shows their comparison based on the best rank achieved on the test set. In short, for this experiment: AFM > FFM » DeepFM > DLRM_simplified > Wide & Deep > NFM > FM.
Summary
In this article, we defined the need for modeling feature interactions and then looked at some of the most popular machine learning algorithms designed to estimate feature interactions under sparse settings. We implemented all of the algorithms in Python and compared their results on a toy dataset.
References

Cheng et al. (2016). Wide & Deep Learning for Recommender Systems. 710. 10.1145/2988450.2988454. ↩︎

https://www.kaggle.com/datasets/prajitdatta/movielens100kdataset ↩︎