Independently examined simple and multivariable regression, weights, bias, MSE, and gradient descent
Examined the weaknesses of one-hot encoding, regex, N-Gram counting, and pretrained word vectors, as well as the implications of RNNS and LSTM as we transition to attention-oriented systems
Strengthened understanding of Machine Learning fundamentals (through the webinars and NLP Basics Series) such as Deep Learning, similarity, and attention models.
Examined some of the applications of linear algebra and calculus in machine learning.
Studied and employed webscraping and the documentation for utilizing specific libraries.
Studied the application of PyTorch in storing embeddings as matrices through torch.nn.Embedding, N-Gram language modeling, and CBOW
Tools
VS Code programming environment
BeautifulSoup4/Selenium/raw urllib for webscraping in Python
Git and GitHub Desktop for distributing/organizing programs into repositories
Google Colab / Jupyter Notebooks for data visualization
Soft Skills
Coordinated group progress across STEM-away forum, Slack, Trello, WhatsApp, and Google Forms and reconsolidated communicative abilities
Developed logistical organization skills through categorizing announcements/changes
Read official documentation and public discussion boards in order to effectively debug programs
Examined ethical practices behind webscraping and robot.txt files of different webpages in order to understand what can and cannot be scraped
Achievements
Established Discord server to augment Slack communication with more intuitive voice channel accessibility
Integrated Discord with team communication and consolidated member feedback through Google Form check-in
Studied the webinars, NLP Basics, and other materials to solidify understanding of Machine Learning
Installed important packages through terminal and source pages
Scraped Discourse forums with Python
Practiced implementation of Git/GitHub
Outcome
Implemented a more intuitive workspace for team members, established a firm understanding of where members were in understanding, and gained critical insight into machine learning theory and application
Developed PM skills through logistical and technical means by thoroughly addressing team member questions on Slack/WhatsApp and directing members to appropriate resources
Ran four different classification models on data: naive bayes, decision tree, linear support vector machine, logistic regression
Refined combinations of methodologies for cleaning data in combined csv of scraped data (testing 5 different strategies i.e. lowercase + removal of special symbols)
Tested different feature selections and observed respective effects on model accuracy → moved forward with second feature selection strategy (author + topic title + leading comment + other comments + tags), as it produced the highest overall accuracy (with logistic regression having the highest in all cases)
Tools
VS Code
Pandas, numpy, nltk, sklearn, etc
Git + GitHub
Jupyter Notebook (individual code cell testing with .ipynb files in VSCode)
Soft Skills
Monitored and managed group across Slack, Trello, and Google Forms for presentation preparation / technical troubleshooting
Achievements
Trained basic ML models and recorded a variance of accuracy depending on data cleaning methodology and model type
*Determined most appropriate model/strategy moving forward with the project
Tasks Completed
Implemented four separate classification models for cartalk dataset
Found best method of cleaning data for optimization of model accuracy
Calculated F1 score, recall, and precision
Outcome
Streamlined project results and prepared data for further analysis
Augmented initial four classifications with an additional three → Random Forest, XG Boost, and Light GBM
Tested and recorded accuracy output for the new models for each data cleaning strategy / feature selection, with an emphasis on feature 2
Examined implications/implementation of BERT ML model
Tools
VS Code
Pandas, numpy, nltk, sklearn, etc
Brew → libomp (for macOS)
Git + GitHub
Jupyter Notebook
Soft Skills
Strengthened project coordination abilities across platforms like Google Slides for presentational purposes
Helped troubleshoot minor technical difficulties / library implementation issues
Consolidated independent concept studying skills
Achievements
Studied and implemented classification models of larger complexity - developed tentative methodologies for optimizing model runtime in an Jupyter coding environment
Familiarized myself with the concept of BERT
Studied implementations of xlnet, xlm, roberta, and distilbert models
Tasks Completed
Added an additional three classification models to NLP project → observed implications through analysis of respective accuracy
Consolidated and resolved runtime inefficiencies with increased complexity of new models
Outcome
Gained important insights on the distinction between separate approaches (model/strategy/feature selection) with regards to accuracy/data analysis
Realized the importance of balanced data as opposed to unbalanced