Vedant Tewari Self-Assessment Summer 2023

Project Title:

AI Code Assistants

Subtitle:

Dimensionality Reduction and Text Classification Approaches in Group Sentiment Analysis

Pathway:

Machine Learning Models

Mentor Chains® Role:

Mentor Chains® Particpant

Goals for the Internship:

  1. Acquiring a foundational comprehension of machine learning and the deployment of ML models.
  2. Enhancing collaborative skills within a team to collectively pursue objectives, thereby broadening one’s scope and fostering learning opportunities.
  3. Building practical experience in both machine learning and teamwork.
  4. Diving into advanced data augmentation methods, such as generative adversarial networks (GANs) or self-supervised learning, to push the boundaries of model performance.
  5. Gaining proficiency in hyperparameter tuning techniques, optimizing model parameters to achieve peak performance and fine-tuning models for specific tasks and datasets.

Achievement Highlights:

  1. Feature Extraction with TF-IDF: Successfully applied Term Frequency-Inverse Document Frequency (TF-IDF) to convert textual data into numerical features.
  2. Dimensionality Reduction Techniques: Utilized dimensionality reduction techniques such as t-distributed Stochastic Neighbor Embedding (t-SNE), Isomap, and Linear Discriminant Analysis (LDA) to reduce the high-dimensional feature space into lower dimensions.
  3. Cluster Analysis with K-Means: Applied K-Means clustering to group similar data points into clusters based on the reduced feature representations.

Challenges Faced:

  1. Model Selection Challenge: Initially, I encountered difficulties in selecting the right machine learning models for my task. Some models, like VAE, posed issues with feature selection, resulting in loss values turning to NaN. Additionally, SVD and Feature Selection methods didn’t provide conclusive outcomes. This showcases the importance of thoughtful model selection to ensure effective learning and feature representation.
  2. Inconsistent Predictions: Another challenge I grappled with was the inconsistency in predictions from some of the models. This inconsistency indicated that model performance varied across different runs, potentially due to variations in data quality or quantity.
  3. Data Volume Requirement: A recurring challenge was the necessity for a substantial volume of data. Specifically, in NLP tasks, it became evident that having an initial dataset with a significant number of samples, exceeding 5,000, was critical to achieving higher accuracy levels, typically in the 70s or 80s.

Detailed Statement:

I primarily focused on Dimensionality Reduction for Group Sentiment Analysis using various techniques, including TF-IDF, Isomap, GDA (Generalized Discriminant Analysis), and K-Means clustering, with a primary focus on Principal Component Analysis. Several challenges I encountered during the project, such as issues with data volume and quality, the choice of dimensionality reduction techniques, and handling noisy data. Despite these challenges, I gained valuable skills in web scraping, data preprocessing, text analysis, and teamwork.

Data Pre-Processing

1. Text Preprocessing: Tokenization, the process of splitting text into individual words or tokens, was applied.

2. Sentiment Analysis Preparation: To facilitate sentiment analysis, the data was prepared according to the requirements of the sentiment analysis tools. For instance, VADER sentiment analysis necessitated the data to be split into sentences. This step ensured that the data was ready for both qualitative and sentiment-related analysis (As mentioned part of work done by the other group members)

3. Dealing with Missing Data I implemented strategies to address this issue, deciding whether to exclude such data points or use imputation methods to handle missing data. This step was crucial for maintaining data integrity. First, the columns that contained information about the top replies and their corresponding upvotes. It was crucial to work with the correct column names to ensure the accuracy of the data manipulation.

2. Remove Columns: Using the drop method in Python, we removed the columns related to top replies and upvotes from the Data Frame. This step effectively eliminated unnecessary data, making the dataset more concise and focused on the attributes of interest.

6. Data Quality Assurance: Inconsistent or anomalous data points were identified and addressed, enhancing the reliability of our subsequent analyses.

Modeling

How the Code Worked (Example of Dimensionality Reduction with PCA):

  1. Importing Libraries: The necessary libraries are imported, including NumPy, Matplotlib, Seaborn, scikit-learn’s TfidfVectorizer and KMeans, and Pandas.
  2. Reading Data: The code reads data from several different input csv files (different datasets that were collected from Web Scraping)
  3. Data Preprocessing: The NaN values in the ‘Question’ and ‘Answer’ columns of the dataset are replaced with empty strings using the fillna method.
  4. Extracting Text Data: The ‘Question’ and ‘Answer’ columns are extracted and converted to lists.
  5. TF-IDF Vectorization: The TF-IDF vectorizer is used to convert the text data into TF-IDF matrices for both questions and answers. This step transforms the text data into numerical vectors that can be used for clustering.
  6. K-Means Clustering: K-Means clustering is performed on the TF-IDF matrices separately for both questions and answers. The KMeans class from scikit-learn is used for clustering. The specified number of clusters is set to 10.
  7. Cluster Tags Definition: Meaningful cluster tags are defined for both questions and answers. These tags provide human-interpretable labels for each cluster.
  8. Assigning Cluster Tags: The cluster labels obtained from K-Means clustering for both questions and answers are mapped to the defined cluster tags. This step assigns each data point to a specific cluster based on its cluster label.
  9. Adding Cluster Tags to DataFrame: The assigned cluster tags for both questions and answers are added as new columns to the DataFrame.
  10. Saving Results: The modified DataFrame with cluster tags is saved to a new CSV file
  11. Printing Sample Data: A sample of the modified DataFrame is printed using the head method to verify the results.

One of the significant achievements was the successful application of dimensionality reduction techniques like Isomap and GDA to transform high-dimensional textual data into lower-dimensional representations for clustering. Additionally, I used K-Means clustering to categorize questions and answers into meaningful clusters, contributing to the overall organization of the data.

Furthermore, I build several models utilizing various approaches and considerations for dimensionality reduction and data preprocessing, such as attention mechanisms, data augmentation, semi-supervised learning, and regularization techniques.

Brief Overview of Alternative Approaches to PCA for Dimensionality Reduction:

  1. UMAP (Uniform Manifold Approximation and Projection) : UMAP is another nonlinear dimensionality reduction technique used for visualizing and clustering high-dimensional data.
  2. Truncated Singular Value Decomposition (Truncated SVD): Truncated SVD is a method similar to PCA that’s specifically designed for sparse matrices. It approximates the matrix using a specified number of singular values and vectors, allowing to reduce dimensionality while preserving the structure of the data.
  3. t-distributed Stochastic Neighbor Embedding (t-SNE): Unlike PCA, which is a linear technique, t-SNE is a nonlinear dimensionality reduction technique that is particularly useful for visualizing high-dimensional data in lower dimensions.
  4. Random Projections: Random projection techniques involve projecting high-dimensional data onto a lower-dimensional space using random matrices. Despite their simplicity, these techniques can sometimes work surprisingly well for dimensionality reduction.
  5. Non-negative matrix factorization (NMF): NMF decomposes a non-negative matrix to the product of two non-negative ones, which has been a promising tool in fields where only non-negative signals exist, In comparison with PCA, NMF does not remove the mean of the matrices, which leads to unphysical non-negative fluxes; therefore NMF is able to preserve more information than PCA.

Alternative Choices for Network Architecture:

  1. Regularization Techniques : Regularization techniques like L1 or L2 regularization can be applied to the autoencoder’s loss function to prevent overfitting. Regularization helps to control the complexity of the learned representations and can lead to more robust features.
  2. Denoising Autoencoders : Denoising autoencoders are trained to reconstruct input data from noisy versions of themselves. This can help the autoencoder learn more robust and meaningful features by forcing it to disregard noise.
  3. Variational Autoencoders (VAEs) : VAEs are a type of autoencoder that learns a probabilistic mapping between the input data and a latent space. They allow for generating new data samples and can be useful for data generation tasks in addition to dimensionality reduction.
  4. Attention Mechanisms : Introducing attention mechanisms within the autoencoder can allow it to focus on different parts of the input data, potentially capturing more relevant features.
  5. Semi-Supervised Learning : Utilizing labeled data, it can help the autoencoder learn more meaningful features that are aligned with the labels.

Lastly I believe this project allowed me to apply a diverse set of technical skills in natural language processing (NLP) and data analysis. It also enhanced my soft skills, including communication and collaboration. The experience gained in handling large-scale data, and applying machine learning techniques will undoubtedly be valuable for future projects and endeavors.

STEM-Away Certificate for Vedant Tewari.pdf (700.6 KB)