🔸 Code Along for F-IE-1: Text Classification using NLP

Task: Create, Compare, and Visualize Movie Review Sentiment Analyzers

Welcome to an exciting exploration of Natural Language Processing (NLP)! In this task, you’ll create a sentiment analysis model to determine whether movie reviews are positive or negative. You’ll then analyze its performance and create compelling visualizations to showcase your results.

Please share your work by replying to this post with screenshots of your visualizations. Then, comment on at least two other submissions. Your active participation, analysis, visualizations, and peer feedback will count as your evaluation for the Virtual-Internships. Alternatively, you may opt to discuss your work in the AI Evaluator session, where your understanding and insights of the Text Classification Project will be assessed.

Task Overview

  1. Build a Sentiment Analyzer: Create a model using the provided dataset of 200 movie reviews. Experiment with different feature extraction methods (e.g., CountVectorizer, TfidfVectorizer) and choose the best performing one.

  2. Enhance the Model: Improve your model’s accuracy by implementing at least two text preprocessing techniques (e.g., lowercase conversion, punctuation removal, stopword removal, stemming/lemmatization). Briefly explain your choice of preprocessing steps in your visualizations.

  3. Test, Compare, and Visualize: Use your model to analyze real movie reviews. Evaluate its performance using metrics such as accuracy, precision, recall, F1-score, and ROC curve with AUC score. Create visualizations that highlight these metrics and provide insights into your model’s strengths and weaknesses.

  4. Advanced Analysis: Develop at least one custom visualization that provides unique insights into your model’s performance or the data itself. This could be an analysis of feature importance, error patterns, or any other aspect you find interesting.

  5. Peer Review: Engage with your peers by providing constructive feedback on at least two other submissions. Discuss the effectiveness of preprocessing, the creativity and informativeness of visualizations, and the thoroughness of the analysis. Suggest potential improvements and share insights gained that could inform future work.

Basic Python Code for Sentiment Analysis:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc

# Load the IMDB dataset
from sklearn.datasets import load_files

# Note: Before running this script, download the IMDB dataset from 
# http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
# Extract the contents and ensure you have a folder named 'aclImdb' in your working directory

# Load all positive and negative reviews
dataset = load_files(r"./aclImdb", categories=['pos', 'neg'], shuffle=True, random_state=42)

# Convert to DataFrame for easier handling
df = pd.DataFrame({'review': dataset.data, 'sentiment': dataset.target})

# Convert reviews to string (they're initially bytes)
df['review'] = df['review'].apply(lambda x: x.decode('utf-8'))

# Limit to 200 reviews for this task
df = df.sample(n=200, random_state=42)

reviews = df['review'].tolist()
sentiments = df['sentiment'].tolist()

print(f"Dataset loaded. Number of reviews: {len(reviews)}")
print(f"Sample review: {reviews[0][:100]}...")  # Print first 100 characters of first review

def preprocess_text(text):
    # TODO: Implement at least two text preprocessing techniques
    # Examples: lowercase conversion, punctuation removal, stopword removal, stemming/lemmatization

# Apply preprocessing
preprocessed_reviews = [preprocess_text(review) for review in reviews]

def extract_features(reviews):
    # TODO: Experiment with different feature extraction methods (e.g., CountVectorizer, TfidfVectorizer)
    # Choose the best performing one

# TODO: Extract features from preprocessed reviews
# X = ...
# y = ...

# TODO: Split the data into training and testing sets
# X_train, X_test, y_train, y_test = ...

def train_model(X_train, y_train):
    # TODO: Train your chosen model

# TODO: Train the model
# model = ...

def evaluate_model(model, X_test, y_test):
    # TODO: Evaluate the model using accuracy, precision, recall, F1-score, and ROC curve with AUC score

# TODO: Evaluate the model
# metrics = ...

def visualize_model_performance(metrics):
    # TODO: Create visualizations that highlight the model's performance metrics

def visualize_feature_importance(model, feature_names):
    # TODO: Create a visualization of feature importance

def custom_visualization(model, X, y):
    # TODO: Develop at least one unique visualization that provides deeper insights into your model or the data
    # Ideas: heatmap of word correlations, prediction confidence vs. review length, analysis of misclassified reviews

# Main execution
if __name__ == "__main__":
    # TODO: Run all the necessary steps and create visualizations
    # Remember to save or display all visualizations for submission

Instructions for Students:

  • Setup: Install Jupyter Notebook and necessary libraries. Ensure you have the following libraries: numpy, pandas, matplotlib, seaborn, and scikit-learn.

  • Preprocessing: Implement at least two text preprocessing techniques. Visualize the impact of these techniques on your model’s performance.

  • Feature Extraction: Experiment with different feature extraction methods. Create a visualization comparing their effectiveness.

  • Model Evaluation: Calculate and visualize accuracy, precision, recall, F1-score, and the ROC curve with AUC score. Interpret these metrics in the context of sentiment analysis.

  • Custom Visualization: Develop at least one unique visualization that provides deeper insights into your model or the data. This could be:

    • A heatmap showing the correlation between certain words and sentiment
    • A visualization of how prediction confidence varies with review length
    • An analysis of misclassified reviews and common patterns among them

Sharing and Discussion

Share your results by posting screenshots of your Jupyter Notebook visualizations, including one or more of:

  • Model performance metrics and ROC curve
  • Impact of preprocessing techniques
  • Comparison of feature extraction methods
  • Your custom visualization
  • Analysis of real movie reviews

Let your visualizations tell the full story without additional written explanations!

Peer Review

After submitting your own work, review and comment on at least two other students’ submissions. In your comments, consider the following:

  1. What do you find interesting or innovative about their approach or visualizations?
  2. How do their results compare to yours? Are there any notable differences in model performance or insights?
  3. Based on their visualizations, can you suggest any potential improvements or areas for further exploration?
  4. Is there anything you learned from their submission that you might apply to your own work in the future?

Remember to be constructive and respectful in your feedback. The goal is to learn from each other and foster a collaborative learning environment.