Contents

Recommender Systems for Modeling Feature Interactions under Sparse Settings

Many machine learning domains, such as recommender systems, targeted advertisement, search ranking, and text analysis contain highly sparse data because of the large categorical variable domains. This sparsity makes it hard for ML algorithms to model second-order and above feature interactions. In this article, I summarize the need for modeling feature interactions and introduce some of the most popular ML architectures designed for estimating interactions from sparse data.

Introduction

Problem with sparse inputs

A variety of Information Retrieval and Data Mining tasks, such as Recommender Systems, Targeted Advertising, Search Ranking, etc. model discrete and categorical variables extensively. As an example, these variables could be user IDs, product IDs, advertisement IDs, and user demographics such as gender and occupation. A common technique to use these categorical variables in machine learning algorithms is to convert them to a set of binary vectors via one-hot encoding. This means that for a system with a large userbase and catalog the resultant feature vector is highly sparse. As explained next, this highly sparse setting makes it difficult to incorporate feature interactions.

Why modeling feature interaction is important?

As an example, consider an artificial Click-through rate (CTR) dataset shown in the table below, where + and - represent the number of clicked and unclicked impressions respectively.

Using Factor Matrices

We can see that an ad from Gucci has a high CTR on Vogue. It is difficult for linear models to learn this information because they learn the two weights Gucci and Vogue separately. To address this problem, a machine learning model will have to learn the effect of their feature interaction. Algorithms such as Poly2 (degree-2 polynomial) do this by learning a dedicated weight for each feature conjunction ($O(n^{2})$ time complexity).

Note: An order-2 interaction can be between two features like app category and timestamp. For example, people often download Uber apps for food delivery at meal-time. Similarly, an order-3 interaction can be between app category, user gender, and age. For example, a report showed that male teenagers like shooting and RPG games. Research shows that considering low- and high-order feature interactions simultaneously brings additional improvement over the cases of considering either alone1. For a highly sparse dataset, these feature interactions are often hidden and difficult to identify a priori.

Memorization and Generalization paradigms

One common challenge in building such applications is to achieve both memorization and generalization.

  • Memorization is about learning the frequent co-occurrence of features and exploiting the correlations observed from the historical data. Google Play store recommender system might look into two features ‘installed_app’ and ‘impression_app’ to calculate the probability of a user installing the impression_app. Memorization-based algorithms use cross-product transformation and look at the features, such as “AND(previously_installed_app=netflix, current_impression_app=spotify)”, whose value is 1 and correlate it with the target label. While such recommendations are topical and directly relevant based on the user’s previous actions, they can not generalize when cross-feature interaction never happened in the training data. For example, there is no training data for the pair (NBC, Gucci) in the table above. Creating cross-product transformations may also require a lot of manual feature engineering effort.
  • Generalization is based on the transitivity of correlation and exploring new feature combinations that have never or rarely occurred in the past. To achieve this generalization we use embeddings-based methods, such as factorization machines or deep neural networks by learning a low-dimensional dense embedding vector. While such recommenders tend to improve the diversity of the recommended items, they also suffer from the problem of over-generalizing. These methods have to learn effective low-dimensional representations from sparse and high-rank data. For cases such as users with specific preferences, or niche items with a narrow appeal, dense embeddings lead to nonzero predictions and thus make less relevant recommendations. Comparatively, it is easier for generalized linear models with cross-product feature transformations to memorize these “exception rules” with much fewer parameters.

In this article, I will introduce, implement and compare seven popular frameworks to achieve both memorization and generalization in individual model architectures.

Dataset Preprocessing

For this article, I will be using the MovieLens 100K dataset2 from Kaggle. It contains 100K ratings from 943 users on 1682 movies, along with demographic information for the user. Each user has rated at least 20 movies. The following transformations were applied to prepare the dataset for experimentation.

  • Dataset was sorted by user_id and timestamp to create time ordering for each user.
  • A target column was created which is time-wise the next movie that the user will watch.
  • New columns such as average movie rating per user, average movie rating per genre per user, number of movies watched per user in each genre normalized by total movies watched by that user, etc. were created.
  • Gender column was label encoded, and the occupation column was dummy encoded.
  • Continuous features were scaled as appropriate.
  • A sparse binary tensor was created indicating movies that the user has watched previously.
  • Dataset was split into train-test with an 80-20 ratio.

The following diagram shows the various features generated through preprocessing. Some or all of these features are used during experimentation depending upon the specific model architecture.

input features

Refer to the respective notebook for data preprocessing steps specific to each model.

Experimentation

Each of the model implementations is done in its standalone Jupyter notebook file and is linked in the corresponding section below. Every notebook contains code for data preprocessing, model definition, training, and evaluation as appropriate to the respective model.

1. Factorization Machine (FM)

Factorization Machines (FMs) were originally purposed as a supervised machine learning method for collaborative recommendations in 2010. FMs allow parameter estimation under very sparse data where its predecessors, like SVMs failed, and unlike Matrix Factorization it isn’t limited to modeling the relation of two entities only. FMs learn second-order feature interactions on top of a linear model by modeling all interactions between each pair of features. They can also be optimized to work in linear time complexity. While in principle, FM can model high-order feature interaction, in practice usually only order-2 features are considered due to high complexity.

FM equation 1

where $w_{0}$ is the global bias, $w_{i}$ denotes the weight of the i-th feature, and $ŵ_{ij}$ denotes the weight of the cross feature $x_{i}x_{j}$, which is factorized as: $ŵ_{ij} = v_{i}^{T} v_{j}$, where $v_{i} \in R^{k}$ denotes the embedding vector for feature i, and k denotes the size of embedding vector. Note that due to the coefficient $x_{i}x_{j}$, only interactions between non-zero features are considered.

FM example

The example above shows a sparse input feature vector x with corresponding target y. We have binary indicator variables for the user, movie, and the last movie rated, along with a normalized rating for other movies and timestamps. Let’s say we want to estimate the interaction between the user Alice (A) and the movie Star Trek (ST). You will notice that in x we do not have any example in x with an interaction between A and ST. FM can still estimate this by using the factorized interaction parameters $\langle v_{A}, v_{ST} \rangle$ even in this case.

Implementation

Suppose we have M training objects, n features and we want to factorize feature interaction with vectors of size k i.e. dimensionality of $v_{i}$. Let us denote our trainset as $X \in R^{M×n}$, and matrix of $v_{i}$ (the ith row is $v_{i}$) as $V \in R^{n×k}$. Also, let’s denote the feature vector for the jth object as $x_{j}$. So:

equation 3

The number in brackets indicates the index of the sample for x and the index of the feature for v. Also, the last term in the FM equation can be expressed as:

equation 2

$S_{1}$ is a dot product of feature vector $x_{j}$ and ith column of V. If we multiply X and V, we get:

equation 4

So if square XV element-wise and then find the sum of each row, we obtain a vector of $S_{1}^{2}$ terms for each training sample. Also, if we first square X and V element-wise, then multiply them, and finally sum by rows, we’ll get $S_{2}$ term for each training object. So, conceptually, we can express the final term like this:

equation 6

PyTorch Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class FM(nn.Module):
    def __init__(self, input_dim, n_class, k=5):
        super().__init__()
        self.V = nn.Parameter(torch.randn(input_dim, k),requires_grad=True)
        self.linear_layer = nn.Linear(input_dim, n_class, device=device)
        
    def forward(self, x):
        square_of_sum = torch.matmul(x, self.V.to(device)).pow(2).sum(1, keepdim=True) #S_1^2
        sum_of_square = torch.matmul(x.pow(2), self.V.to(device).pow(2)).sum(1, keepdim=True) # S_2
        
        out_inter = 0.5 * (square_of_sum - sum_of_square)
        out_lin = self.linear_layer(x)
        out = out_inter + out_lin
        return out

Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/2a639276fc781870c4dcd480a3417bf9

You can read more about FMs in the original paper and the official codebase.

2. Field-Aware Factorization Machine (FFM)

Let’s revisit the artificial dataset example from earlier and add another column Gender to it. One observation of such a dataset might look like this:

FFM example

FMs will model the effect of feature conjunction as: $w_{ESPN} \cdot w_{Nike} + w_{ESPN} \cdot w_{Male} + w_{Nike} \cdot w_{Male}$. Field-aware factorization machines (FFMs) extend FMs by introducing the concept of fields and features. With the same example, Publisher, Advertiser, and Gender will be called fields, while values like ESPN, Nike, and Male will be called their features. Note that in FMs every feature has only one latent vector to learn the latent effect with any other features. For example, for ESPN $w_{ESPN}$ is used to learn the latent effect with Nike ($w_{ESPN} \cdot w_{Nike}$) and Male ($w_{ESPN} \cdot w_{Male}$). However, because Nike and Male belong to different fields, the latent effects of (EPSN, Nike) and (EPSN, Male) may be different.

In FFMs, each feature has several latent vectors. Depending on the field of other features, one of them is used to do the inner product. So, for the same example, the feature interaction effect is modelled by FFM as: $w_{ESPN,A} \cdot w_{Nike,P} + w_{ESPN,G} \cdot w_{Male,P} + w_{Nike,G} \cdot w_{Male,A}$. To learn the latent effect of (ESPN, NIKE), for example, $w_{ESPN,A}$ is used because Nike belongs to the field Advertiser, and $w_{Nike,P}$ is used because ESPN belongs to the field Publisher.

If f is the number of fields, then the number of variables of FFMs is nfk, and the time complexity is $O(\bar{n}^{2}k)$. Note that because each latent vector in FFMs only needs to learn the effect with a specific field, usually: $k_{FFM} \ll k_{FM}$. FFM authors empirically show that for large, sparse datasets with many categorical features, FFMs perform better than FMs. Whereas for small and dense datasets or numerical datasets, FMs perform better than FFMs.

Implementation

PyTorch Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class FFM(nn.Module):
    def __init__(self, continuous_dim, field_dims, n_class, embed_dim=16, pad_idx=0):
        super().__init__()
        self.bias = nn.Parameter(torch.zeros((n_class,)))
        self.embeddings = nn.Embedding(sum(field_dims), n_class, padding_idx=pad_idx, device=device)
        
        self.num_fields = len(field_dims)
        self.embeddings_interaction = nn.ModuleList([
            nn.Embedding(sum(field_dims), embed_dim, padding_idx=pad_idx, device=device) for _ in range(self.num_fields)
        ])
        
        self.linear_layer = nn.Linear(continuous_dim, n_class, device=device)

    def forward(self, continuous_X, categorical_X):
        embeds_out = torch.sum(self.embeddings(categorical_X), dim=1) + self.bias.to(device)
        
        field_wise_emb_list = [self.embeddings_interaction[i](categorical_X) for i in range(self.num_fields)]
        ix = list()
        for i in range(self.num_fields - 1):
            for j in range(i + 1, self.num_fields):
                ix.append(field_wise_emb_list[j][:, i] * field_wise_emb_list[i][:, j])
        ix = torch.stack(ix, dim=1)
        ffm_interaction_term = torch.sum(torch.sum(ix, dim=1), dim=1, keepdim=True)
        output = self.linear_layer(continuous_X) + embeds_out + ffm_interaction_term
        return output

Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/c6a8037f4596a8181376313fdba33ffd

You can read more about FFMs in the original paper and the official codebase.

3. Attentional Factorization Machine (AFM)

Note how all interactions in FM are weighted with interaction weight matrix W. This W is a positive definite matrix and it can be shown that given a sufficiently large value of k, there will always be a matrix V such that $W=V\cdot V^{T}$. In sparse settings, we usually restrict k to a smaller value which also leads to better generalizability. However, despite its effectiveness, FMs model all feature interactions with the same weight, but not all feature interactions are equally useful and predictive. For example, the interactions with useless features may even introduce noise leading to degraded model performance. Attentional Factorization Machines (AFMs) fix this by learning the importance of each feature interaction from data via a neural attention network.

AFM architecture

AFM starts with sparse input and embedding layer, and inspired by FM’s inner product, it expands m vectors to m(m-1)/2 interacted vectors, where each interacted vector is the element-wise product of two distinct vectors to encode their interaction. The output of the pair-wise interaction layer is fed into an attention-based pooling layer, the idea here is to allow different parts to contribute differently when compressing them to a single representation. An attention mechanism is applied to feature interactions by performing a weighted sum on the interacted vectors. This weight or attention score can be thought of as the importance of weight $\hat{w}_{ij}$ in predicting the target. However, we have a problem when the features have never co-occurred in the training data. To address this problem, the authors further parameterize attention scores with a multi-layer perceptron (MLP). The attention scores are also normalized through the softmax function. One shortcoming of AFM architecture is that it models only second-order feature interactions.

Implementation

PyTorch Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class AFM(nn.Module):
    def __init__(self, continuous_dim, field_dims, n_class, attn_size=16, embed_size=16, pad_idx=0):
        super().__init__()
        self.embeddings = nn.Embedding(sum(field_dims), embed_size, padding_idx=pad_idx, device=device)
        self.attention = nn.Linear(embed_size, attn_size, device=device)
        self.projection = nn.Linear(attn_size, 1, device=device)
        self.fc = nn.Linear(embed_size, n_class, device=device)
        self.linear_layer = nn.Linear(continuous_dim, n_class, device=device)
    
    def forward(self, continuous_X, categorical_X):
        embeds_out = self.embeddings(categorical_X)
        num_fields = embeds_out.shape[1]
        row, col = list(), list()
        for i in range(num_fields - 1):
            for j in range(i + 1, num_fields):
                row.append(i), col.append(j)
        p, q = embeds_out[:, row], embeds_out[:, col]
        inner_product = p * q
        attn_scores = nn.functional.relu(self.attention(inner_product))
        attn_scores = nn.functional.softmax(self.projection(attn_scores), dim=1)
        attn_output = torch.sum(attn_scores * inner_product, dim=1)
        output = self.linear_layer(continuous_X) + self.fc(attn_output)
        return output

Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/85fe046691c66221bec00bc7e59e145b

You can read more about AFMs in the original paper and the official codebase.

4. Wide & Deep Learning

The Wide & Deep learning framework was proposed by Google to achieve both memorization and generalization in one model, by jointly training a linear model component and a neural network component. The wide component is a generalized linear model to which raw input features and transformed features (such as cross-product transformations) are supplied. The deep component is a feed-forward neural network, which consumes sparse categorical features in embedding vector form. The wide and deep part are combined using a weighted sum of their output log odds as the prediction which is then fed to a common loss function for joint training.

W&D architecture

The authors claim that wide linear models can effectively memorize sparse feature interactions using cross-product feature transformations, while deep neural networks can generalize to previously unseen feature interactions through low-dimensional embeddings. Note that the wide and deep parts work with two different inputs and the input to the wide part still relies on expertise feature engineering. The model also suffers from the same problem as the linear models like Polynomial Regression that features interactions can not be learned for unobserved cross features.

Implementation

PyTorch Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
class WideDeep(nn.Module):
    def __init__(self, wide_dim, n_class, deep_dim, embed_dim, embed_size, pad_idx=0):
        super(WideDeep, self).__init__()
        self.embed = nn.Embedding(embed_dim, embed_size, padding_idx=pad_idx, device=device)
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(embed_size, 1024, device=device),
            nn.ReLU(),
            nn.Linear(1024, 512, device=device),
            nn.ReLU(),
            nn.Linear(512, 256, device=device),
            nn.ReLU()
        )
        self.output = nn.Linear(256+wide_dim, n_class, device=device)
        
    def forward(self, X_w, X_sparse_idx):
        embed = self.embed(X_sparse_idx)
        embed = torch.mean(embed, dim=1)
        deep_logits = self.linear_relu_stack(embed)
        total_logits = self.output(torch.cat((deep_logits, X_w), dim=1))
        return total_logits

Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/a6ab97ed6bc053aaf3d73320b4b31b97

You can read more about Wide&Deep in the original paper and the official codebase.

5. DeepFM

DeepFM model architecture combines the power of FMs and deep learning to overcome the issues with Wide&Deep networks. DeepFM uses a single shared input to its wide and deep parts, with no need of any special feature engineering besides raw features. It models low-order feature interactions like FM and models high-order feature interactions like DNN.

DeepFM architecture

Implementation

PyTorch Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class DeepFM(nn.Module):
    def __init__(self, embed_dim, embed_size, wide_dim, deep_dim, n_class, pad_idx=0):
        super().__init__()
        self.embedding = nn.Embedding(embed_dim, embed_size, padding_idx=pad_idx, device=device)
        self.linear_layer = nn.Linear(wide_dim, n_class, device=device)
        
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(embed_size + deep_dim, 1024, device=device),
            nn.ReLU(),
            nn.Linear(1024, 512, device=device),
            nn.ReLU(),
            nn.Linear(512, 256, device=device),
            nn.ReLU()
        )
        self.output = nn.Linear(256+1, n_class, device=device)

    def forward(self, X_w, X_d, X_sparse_idx):
        embed_x = self.embedding(X_sparse_idx)
        embed_x = torch.mean(embed_x, dim=1)
        
        # FM
        square_of_sum = torch.sum(embed_x, dim=1) ** 2
        sum_of_square = torch.sum(embed_x ** 2, dim=1)
        out_inter = 0.5 * (square_of_sum - sum_of_square)
        # Linear
        out_lin = self.linear_layer(X_w)
        # Deep
        out_deep = self.linear_relu_stack(torch.cat((X_d, embed_x), dim=1))
        
        output = self.output(torch.cat((out_inter.unsqueeze(1), out_deep), dim=1)) + out_lin
        return output

Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/79237a4de62b4033e2576c55df3dc056

You can read more about DeepFM in the original paper.

6. Neural Factorization Machine (NFM)

Neural Factorization Machine (NFM) model architecture was also proposed by the authors of the AFM paper with the same goal of overcoming the insufficient linear modeling of feature interactions in FM. After the sparse input and embedding layer, this time the authors propose a Bi-Interaction layer that models the second-order feature interactions. This layer is a pooling layer that converts a set of the embedding vectors set input to one vector by performing an element-wise product of vectors: $\sum_{i=1}^{n} \sum_{j=i+1}^{n} x_{i}v_{i} \odot x_{j}v_{j}$, and passes it on to another set of fully connected layers.

NFM architecture

Implementation

PyTorch Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class NFM(nn.Module):
    def __init__(self, embed_dim, embed_size, wide_dim, deep_dim, n_class, pad_idx=0):
        super().__init__()
        self.embedding = nn.Embedding(embed_dim, embed_size, padding_idx=pad_idx, device=device)
        self.linear_layer = nn.Linear(wide_dim, n_class, device=device)
        
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(deep_dim + 1, 1024, device=device),
            nn.ReLU(),
            nn.Linear(1024, 512, device=device),
            nn.ReLU(),
            nn.Linear(512, n_class, device=device),
            nn.ReLU()
        )

    def forward(self, X_w, X_d, X_sparse_idx):
        embed_x = self.embedding(X_sparse_idx) # movies_train_idx
        embed_x = torch.mean(embed_x, dim=1)
        
        # FM
        square_of_sum = torch.sum(embed_x, dim=1) ** 2
        sum_of_square = torch.sum(embed_x ** 2, dim=1)
        out_inter = 0.5 * (square_of_sum - sum_of_square)
        # Linear
        out_lin = self.linear_layer(X_w)
        # Deep
        out_deep = self.linear_relu_stack(torch.cat((X_d, out_inter.unsqueeze(1)), dim=1))
        
        output = out_deep + out_lin
        return output

Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/f29d7001b7687785f33636c9bca302c3

You can read more about NFM in the original paper and the official codebase.

7. Deep Learning Recommendation Model (DLRM)

Deep Learning Recommendation Model (DLRM) was proposed by Facebook in 2019. The DLRM architecture can be thought of as a simplified version of DeepFM architecture. DLRM tries to stay away from high-order interactions by not passing the embedded categorical features through an MLP layer. First, it makes the continuous features go through a “bottom” MLP layer such that they have the same length as the embedding vectors. Then, it computes the dot product between all combinations of embedding vectors and the bottom MLP output from the previous step. The dot product output is then concatenated with the bottom MLP output and is passed to a “top” MLP layer to compute the final output.

DLRM architecture

This architecture design is tailored to mimic the way Factorization Machines compute the second-order interactions between the embeddings, and the paper also says:

We argue that higher-order interactions beyond second-order found in other networks may not necessarily be worth the additional computational/memory cost.

Implementation

The paper also notes that DLRM contains far more parameters than common models like CNNs, RNNs, GANs, and transformers, making the training time for this model go up to several weeks. They also propose a framework to parallelize DLRM operations. Due to high compute requirements, I couldn’t train the DLRM model myself. DLRM results in this experimentation are with a simplified implementation using a concatenation of bottom MLP output and embedding output, instead of their dot product.

PyTorch Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
class DLRM(nn.Module):
    def __init__(self, embed_dim, embed_size, deep_dim, n_fields, n_class, pad_idx=0, interaction_op="cat"):
        super().__init__()
        self.embedding = nn.Embedding(embed_dim, embed_size, padding_idx=pad_idx, device=device)
        
        self.bottom_mlp_stack = nn.Sequential(
            nn.Linear(deep_dim, 1024, device=device),
            nn.ReLU(),
            nn.Linear(1024, 512, device=device),
            nn.ReLU(),
            nn.Linear(512, embed_size, device=device),
            nn.ReLU()
        )
        self.interaction_op = interaction_op # ["dot", "cat"]
        if self.interaction_op == "dot":
            p, q = zip(*list(combinations(range(n_fields), 2)))
            self.field_p = nn.Parameter(torch.LongTensor(p), requires_grad=False)
            self.field_q = nn.Parameter(torch.LongTensor(q), requires_grad=False)
            self.interaction_units = int(n_fields * (n_fields - 1) / 2)
            self.upper_triange_mask = nn.Parameter(torch.triu(torch.ones(n_fields, n_fields), 1).type(torch.ByteTensor),
                                                   requires_grad=False)
            # torchrec style implementation (as an alterante to above)
            # self.triu_indices: torch.Tensor = torch.triu_indices(
            #     self.n_fields + 1, self.n_fields + 1, offset=1
            # )
            self.top_input_dim = (n_fields * (n_fields - 1)) // 2 + embed_size
        elif self.interaction_op == "cat":
            self.top_input_dim = (n_fields+1) * embed_size
        
        self.top_mlp_stack = nn.Sequential(
            nn.Linear(self.top_input_dim, 1024, device=device),
            nn.ReLU(),
            nn.Linear(1024, 512, device=device),
            nn.ReLU(),
            nn.Linear(512, n_class, device=device),
            nn.ReLU()
        )

    def forward(self, X_d, X_sparse_idx):
        embed_x = self.embedding(X_sparse_idx) # movies_train_idx
        
        # bottom mlp
        dense_out = self.bottom_mlp_stack(X_d).unsqueeze(1)
        feat_emb = torch.cat([embed_x, dense_out], dim=1)
        
        # interaction
        if self.interaction_op == "dot":
            inner_product_matrix = torch.bmm(feat_emb, feat_emb.transpose(1, 2))
            flat_upper_triange = torch.masked_select(inner_product_matrix, self.upper_triange_mask)
            interact_out = flat_upper_triange.view(-1, self.interaction_units)
            # torchrec style implementation (as an alterante to above)
            # interactions = torch.bmm(
            #     feat_emb, torch.transpose(feat_emb, 1, 2)
            # )
            # interactions_flat = interactions[:, self.triu_indices[0], self.triu_indices[1]]
            # interact_out = torch.cat((dense_out, interactions_flat), dim=1)
        else:
            interact_out = feat_emb.flatten(start_dim=1) # torch.Size([76228, 11792]) 737*16
        
        # top mlp
        output = self.top_mlp_stack(interact_out)
        return output

Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/a02a83fbb3ae5e293fde4b90e3a319d7

You can read more about DLRM in the original paper and the official codebase.

Comparing all models

The mean rank of the test set examples was computed using each model, and the following chart shows their comparison based on the best rank achieved on the test set. In short, for this experiment: AFM > FFM » DeepFM > DLRM_simplified > Wide & Deep > NFM > FM.

Comaring all models

Summary

In this article, we defined the need for modeling feature interactions and then looked at some of the most popular machine learning algorithms designed to estimate feature interactions under sparse settings. We implemented all of the algorithms in Python and compared their results on a toy dataset.

References


  1. Cheng et al. (2016). Wide & Deep Learning for Recommender Systems. 7-10. 10.1145/2988450.2988454. ↩︎

  2. https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset ↩︎