Project Title:
Emerging Horizons in Coding: A Multifaceted Investigation into AI Code Assistants
Subtitle:
It was an intense learning journey from the complexities of web scraping to the nuances of recommender systems and rigorous collaborative teamwork. I would like to thank my teammates and especially my lead, @Fay_Elhassan
Pathway:
Machine Learning
Mentor Chains® Role:
Project Management Lead. Member of the Machine Learning subteam focusing on Article Recommendations
Goals for the Internship:
- Investigate the capabilities and potential of AI code assistants.
- Scrape AI code assistant-related data from Medium.
- Develop and refine a recommender system for articles, posts, and videos, with a focus on Medium.
- Set up the Softr site for the team, leveraging prior experience with no-code platforms.
- Implement an Airtable-based project management tool to streamline team operations.
Achievement Highlights:
- Successfully gathered data from Medium, navigating through website structure changes and absence of clear tags.
- Developed two recommendation methods using Word2Vec and BERT embeddings combined with cosine similarity.
- Set up the Softr site, drawing on previous experience with platforms like Wix.
- Established an Airtable-based project management system for organized team collaborations.
- Aggressively refined text data to enhance recommendation quality by removing common words.
- Utilized computed columns like vector magnitude, average similarity score, and the number of highly similar articles to validate the recommender system’s effectiveness.
Challenges Faced:
- Encountering website structure changes and the absence of clear tags while scraping Medium.
- Initial recommendations from the recommender system were too similar, requiring extensive refinement.
- Conceptual challenges related to the newness and complexity of the recommender system.
- Multitasking of the various tasks.
Detailed Statement:
Emerging Horizons in Coding: A Multifaceted Investigation into AI Code Assistants
Introduction:
Our team is investigating the capabilities and potential of AI code assistants. I am a member of the machine learning subteam, along with Brian and Krishna, focusing on developing a recommender system for articles, posts and videos from various platforms.
As part of this initiative, I am responsible for gathering data from Medium and testing our recommender system on these articles. This phase serves as a stepping stone for refining our system to handle data from multiple sources.
The larger goal of our subteam is to create a comprehensive recommender system that integrates with: (1) a Tableau-based Dashboard (2) a Chatbot and (3) an Online Resource Website
Our work aims to deliver tailored article recommendations across multiple platforms, enhancing the experience for users interested in AI code assistants.
Data Source:
The data used in this phase consists of Medium articles obtained with this script I developed: https://github.com/anya-chauhan/bytemasters/blob/main/medium_articles_scraper.py
Articles in the dataset include the following attributes: keyword, source, URL, title, subtitle, summary, reactions, member-only status and date.
Goal:
The primary goal of this project is to create a recommender system for the Medium articles using two different methods: Word2Vec embeddings combined with cosine similarity, and BERT embeddings combined with cosine similarity. By generating recommendations based on the textual content of the articles (title, subtitle, and summary), we aim to help users discover articles that are relevant and interesting to them. Recommender code is at https://github.com/anya-chauhan/bytemasters/blob/main/recommender.ipynb. Direct colab link: recommender.ipynb
Challenges:
One of the major challenges faced in this project is verifying the effectiveness of the recommender system since we haven’t yet done user testing. To overcome this, we decided to rely on the following computed columns to provide insights into the quality of the recommendations:
Vector Magnitude: This column represents the magnitude (or length) of the vector representation of the article text. It can provide an indication of the richness or diversity of the content in the article.
Average Similarity Score: This column represents the average similarity score of an article with all other articles in the dataset. It is calculated by taking the mean value of the corresponding row in the similarity matrix. Higher values indicate that the article is more similar to other articles in the dataset on average.
Number of Highly Similar Articles: This column represents the number of articles that are highly similar to the given article, based on a specified similarity threshold. It is calculated by counting the number of similarity scores in the corresponding row of the similarity matrix that exceed the threshold (excluding the article itself).
Approach:
Preprocessing: Initially, the text data was preprocessed by tokenizing, removing stopwords and punctuation, and stemming the words.
Common Word Removal: However, the results were not satisfactory as most of the recommendations were very similar, and there was little differentiation among the articles. To improve the quality of the recommendations, an aggressive removal of common words was applied. In order to improve the quality of our text data, we experimented with removing common words. We started by removing words that appeared in more than 30% of the articles. However, we found that this initial threshold did not yield meaningful results with word2vec embeddings. So we progressively reduced the threshold and observed the impact on the data. We found that going below a threshold of 10% yielded more meaningful results. This allowed us to focus on more unique and relevant words in the articles, which led to more meaningful comparisons and recommendations.
BERT Method: For the BERT method, we observed that even without the removal of common words, the results were meaningful. However, upon inspecting the common words, we determined that they were not contributing to the quality of the recommendations. Therefore, we decided to remove common words for the BERT method as well.
Cosine Similarity: We used cosine similarity to measure the similarity between the vector embeddings of the articles and generate recommendations.
Recommendation Function: We implemented a recommendation function that provides the top 5 most similar articles to a given article based on cosine similarity.
Result and Insights:
The heatmap of the cosine similarity matrix, generated after aggressively removing common words and words with fewer than three characters, is shown above. Prior to this removal, the heatmap for the Word2Vec method was uniformly red, indicating high similarity across all articles. In contrast, the heatmap for the BERT method appeared more varied even without the removal of common words.
Conclusion:
By experimenting with different text preprocessing techniques and embedding methods, we were able to create a recommender system that provides diverse and relevant recommendations for Medium articles. We achieved this by leveraging both Word2Vec and BERT embeddings, combined with cosine similarity, to assess the similarities between articles.
An important aspect of our work was devising a method to determine the accuracy of our recommender system. We used computed columns such as vector magnitude, average similarity score, and the number of highly similar articles to evaluate the effectiveness of our recommendations. These metrics provided insights into the quality of the recommendations and helped validate our approach.
Additionally, by aggressively removing common words, we observed a significant improvement in the quality of recommendations.