🔴 Building a Neural Collaborative Filtering Recommendation System

stemaway · October 26, 2024, 2:14pm

Building a Neural Collaborative Filtering Recommendation System

Objective

Develop a movie recommendation system by implementing Neural Collaborative Filtering (NCF) using deep learning techniques. This project focuses on building a recommendation model that predicts user preferences for movies based on historical interaction data. You will work with the MovieLens dataset to train and evaluate your model, gaining hands-on experience with deep learning frameworks and recommender system algorithms.

Learning Outcomes

By completing this project, you will:

Understand Neural Collaborative Filtering and its application in recommendation systems.
Implement a deep learning model using frameworks like TensorFlow or PyTorch.
Handle and preprocess real-world datasets, preparing them for neural network training.
Evaluate the model’s performance using appropriate metrics and validation techniques.
Gain experience in hyperparameter tuning to optimize model performance.
Understand the challenges and solutions associated with recommendation systems.

Prerequisites and Theoretical Foundations

1. Python Programming (Intermediate Level)

Data Structures: Lists, dictionaries, sets, and tuples.
Control Flow: Loops, conditionals, and functions.
Object-Oriented Programming: Classes and inheritance.
Libraries: Familiarity with Pandas, NumPy, and Matplotlib.

2. Basic Machine Learning Concepts

Supervised Learning:
- Understanding of regression and classification.
Neural Networks:
- Basic knowledge of neural network architectures, activation functions, and training processes.
Evaluation Metrics:
- Understanding of accuracy, precision, recall, and loss functions.

3. Introduction to Recommender Systems

Collaborative Filtering:
- User-based and item-based collaborative filtering.
Implicit vs. Explicit Feedback:
- Understanding the difference and how to handle each type.
Matrix Factorization:
- Basic concept of decomposing the interaction matrix.

Tools Required

Programming Language: Python 3.7+
Libraries and Frameworks:
- Pandas: Data manipulation (pip install pandas)
- NumPy: Numerical computations (pip install numpy)
- Matplotlib: Visualization (pip install matplotlib)
- Scikit-learn: Machine learning utilities (pip install scikit-learn)
- TensorFlow or PyTorch: Deep learning framework (pip install tensorflow or pip install torch)
Dataset:
- MovieLens 100K or 1M: Download from the MovieLens website
Environment:
- Jupyter Notebook or an IDE like PyCharm or VSCode

Project Structure

neural_cf_recommender/
│
├── data/
│   └── movielens/
│       ├── ratings.csv
│       └── movies.csv
│
├── src/
│   ├── data_preprocessing.py
│   ├── model.py
│   ├── train.py
│   ├── evaluate.py
│   └── utils.py
│
└── notebooks/
    └── neural_cf_recommender.ipynb

Steps and Tasks

1. Data Acquisition and Exploration

Tasks:

Download the MovieLens Dataset:
- Choose the 100K or 1M dataset for manageability.
Load Data into Pandas DataFrames:
- Read ratings.csv and movies.csv.
Perform Exploratory Data Analysis (EDA):
- Understand the distribution of ratings.
- Analyze the number of unique users and movies.

Implementation:

import pandas as pd
import matplotlib.pyplot as plt

# Load data
ratings = pd.read_csv('data/movielens/ratings.csv')
movies = pd.read_csv('data/movielens/movies.csv')

# EDA
print(ratings.head())
print(f"Number of users: {ratings['userId'].nunique()}")
print(f"Number of movies: {ratings['movieId'].nunique()}")

# Rating distribution
plt.hist(ratings['rating'], bins=5)
plt.xlabel('Rating')
plt.ylabel('Count')
plt.title('Rating Distribution')
plt.show()

2. Data Preprocessing

Tasks:

Encode User IDs and Movie IDs:
- Convert IDs to continuous integers starting from 0.
Split Data into Training and Testing Sets:
- Use a suitable method to prevent data leakage.
Prepare Data for Model Input:
- Create tensors or arrays suitable for the deep learning model.

Implementation:

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Encode user IDs and movie IDs
user_encoder = LabelEncoder()
movie_encoder = LabelEncoder()

ratings['user'] = user_encoder.fit_transform(ratings['userId'])
ratings['movie'] = movie_encoder.fit_transform(ratings['movieId'])

num_users = ratings['user'].nunique()
num_movies = ratings['movie'].nunique()

print(f"Number of users: {num_users}, Number of movies: {num_movies}")

# Split data
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

3. Implementing the Neural Collaborative Filtering Model

Tasks:

Define the Model Architecture:
- Create embedding layers for users and movies.
- Combine embeddings and pass through neural network layers.
Choose Activation Functions and Output Layer:
- Use appropriate activation functions (e.g., ReLU, sigmoid).
Compile the Model:
- Select loss function and optimizer.

Implementation (using PyTorch):

import torch
import torch.nn as nn

class NeuralCollaborativeFiltering(nn.Module):
    def __init__(self, num_users, num_items, embedding_size):
        super(NeuralCollaborativeFiltering, self).__init__()
        self.user_embedding = nn.Embedding(num_users, embedding_size)
        self.item_embedding = nn.Embedding(num_items, embedding_size)
        self.fc_layers = nn.Sequential(
            nn.Linear(embedding_size * 2, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )
        self.sigmoid = nn.Sigmoid()

    def forward(self, user_indices, item_indices):
        user_embed = self.user_embedding(user_indices)
        item_embed = self.item_embedding(item_indices)
        vector = torch.cat([user_embed, item_embed], dim=-1)
        logits = self.fc_layers(vector)
        rating = self.sigmoid(logits)
        return rating

# Instantiate the model
embedding_size = 20
model = NeuralCollaborativeFiltering(num_users, num_movies, embedding_size)

4. Preparing the Training Loop

Tasks:

Define Loss Function and Optimizer:
- Use binary cross-entropy loss for implicit feedback.
Implement Negative Sampling:
- Generate negative samples for training.
Create DataLoader for Batching:
- Utilize PyTorch’s DataLoader for efficient data loading.

Implementation:

import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Define the loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Prepare the dataset with negative sampling
class NCFDataset(Dataset):
    def __init__(self, data, num_items, num_negatives=4):
        self.users = []
        self.items = []
        self.labels = []
        user_item_set = set(zip(data['user'], data['movie']))
        all_items = set(range(num_items))
        for idx, row in data.iterrows():
            u = row['user']
            i = row['movie']
            self.users.append(u)
            self.items.append(i)
            self.labels.append(1)
            for _ in range(num_negatives):
                negative_item = np.random.choice(list(all_items - set([i])))
                self.users.append(u)
                self.items.append(negative_item)
                self.labels.append(0)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.users[idx], self.items[idx], self.labels[idx]

# Create DataLoader
train_dataset = NCFDataset(train_data, num_movies)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True)

5. Training the Model

Tasks:

Implement the Training Loop:
- Iterate over epochs and batches.
- Perform forward and backward propagation.
Monitor Training Loss:
- Print or log the loss at intervals.

Implementation:

import numpy as np

num_epochs = 5

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for user_batch, item_batch, label_batch in train_loader:
        user_batch = torch.LongTensor(user_batch)
        item_batch = torch.LongTensor(item_batch)
        label_batch = torch.FloatTensor(label_batch)

        optimizer.zero_grad()
        outputs = model(user_batch, item_batch).squeeze()
        loss = criterion(outputs, label_batch)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

6. Evaluating the Model

Tasks:

Prepare Test Data:
- Ensure no data leakage from training.
Choose Evaluation Metrics:
- Use metrics like Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG).
Implement Evaluation Function:
- Calculate metrics over the test set.

Implementation:

def evaluate_model(model, test_data, top_k=10):
    model.eval()
    hits, ndcgs = [], []
    with torch.no_grad():
        for idx, row in test_data.iterrows():
            user = torch.LongTensor([row['user']])
            item = torch.LongTensor([row['movie']])
            label = row['rating']
            # Get prediction scores for all items
            user_items = torch.LongTensor(np.array(range(num_movies)))
            users = user.repeat(num_movies)
            outputs = model(users, user_items).squeeze()
            _, indices = torch.topk(outputs, top_k)
            recommended_items = indices.numpy()
            if row['movie'] in recommended_items:
                hits.append(1)
                rank = np.where(recommended_items == row['movie'])[0][0] + 1
                ndcgs.append(1 / np.log2(rank + 1))
            else:
                hits.append(0)
                ndcgs.append(0)
    hr = np.mean(hits)
    ndcg = np.mean(ndcgs)
    print(f"Hit Ratio @ {top_k}: {hr:.4f}, NDCG @ {top_k}: {ndcg:.4f}")

# Evaluate the model
evaluate_model(model, test_data)

7. Hyperparameter Tuning

Tasks:

Experiment with Different Embedding Sizes:
- Try different sizes like 10, 20, 50.
Adjust Learning Rate and Batch Size:
- Observe the impact on training convergence.
Implement Early Stopping:
- Stop training when validation loss doesn’t improve.

Implementation:

# Example of trying different embedding sizes
for embedding_size in [10, 20, 50]:
    model = NeuralCollaborativeFiltering(num_users, num_movies, embedding_size)
    # Repeat the training and evaluation steps
    # Compare the results to find the optimal embedding size

8. Conclusion and Next Steps

Tasks:

Summarize Findings:
- Discuss model performance and observations.
Identify Potential Improvements:
- Suggest methods like incorporating content-based features or using more advanced architectures.

Conclusion

In this project, you:

Developed a neural collaborative filtering model using deep learning techniques.
Preprocessed and prepared data from the MovieLens dataset.
Implemented a training loop with negative sampling to handle implicit feedback.
Evaluated the model’s performance using appropriate metrics.
Experimented with hyperparameters to optimize the model.

This project provides a solid foundation in building deep learning-based recommendation systems. You can extend this work by:

Incorporating Content-Based Features:
- Integrate movie metadata like genres or descriptions to improve recommendations.
Exploring Advanced Architectures:
- Implement models like Autoencoders or Graph Neural Networks.
Deploying the Model:
- Build an API to serve recommendations in real-time.